15 Out-of-Order Execution

Lecture from: 21.04.2023 | Video: YT

This lecture builds upon the foundation of pipelined and multi-cycle microarchitectures, addressing their limitations and introducing advanced techniques employed in high-performance processors.

Review of In-Order Pipeline Limitations

Our discussion of pipelined processors with multi-cycle execution units revealed a key limitation. While pipelining allows instructions to be in different stages concurrently (Fetch, Decode, Execute, etc.), the typical in-order dispatch mechanism means that instructions are sent to the functional units in their original program sequence.

This in-order dispatch creates a bottleneck when a non-ready instruction reaches the dispatch stage. A non-ready instruction is one whose source operands are not yet available, typically because a preceding instruction that produces the required value has not yet completed its execution.

As reviewed, flow dependencies (Read-After-Write) represent true data dependencies that must be respected, whereas Anti (WAR) and Output (WAW) dependencies are false dependencies arising from the reuse of architectural register names. Register renaming, often implemented with structures like the Reorder Buffer as discussed in the previous lecture, effectively eliminates these false dependencies by providing unique micro-architectural “names” for different versions of a register value.

Even with renaming, if an instruction has a true dependency (RAW) and its required operand is not yet available when it reaches the dispatch stage, it must wait. In an in-order dispatch pipeline, this waiting instruction blocks the dispatch of all subsequent instructions, even if those younger instructions are independent and ready to execute.

This scenario is analogous to a blockage in a sequential processing line, where one stalled item prevents all subsequent items from moving forward, regardless of their readiness or processing path.

The inability to dispatch ready, independent instructions simply because an older, non-ready instruction is stuck in the dispatch stage significantly limits the processor’s ability to extract instruction-level parallelism and tolerate the latency of long-running operations.

The Principle of Out-of-Order Execution

The fundamental principle of Out-of-Order (OoO) Execution, also known as Dynamic Instruction Scheduling, is to overcome the limitations of in-order dispatch by allowing instructions to be sent to the functional units and begin execution whenever their operands become ready, irrespective of their original program order.

This approach allows the processor to dynamically identify and exploit instruction-level parallelism that may not be apparent or safely extractable by a static compiler, particularly in the face of run-time uncertainties like cache misses or branch outcomes.

The core idea is to provide a “waiting area” or buffer where instructions can reside after being decoded and renamed, until all their source operands are available. These buffers are historically called Reservation Stations.

In this waiting area, each instruction monitors the availability of its source values. When all required values for an instruction become ready, that instruction is considered “awake” or ready to “fire” (terms borrowed from dataflow computing). The processor then selects ready instructions from the reservation stations to dispatch to the functional units.

This dynamic scheduling based on data availability, rather than static program order, is the hallmark of Out-of-Order Execution. It directly addresses the problem of dispatch stalls by allowing independent instructions to bypass older, stalled instructions, thereby tolerating the latency of long-running operations and improving overall throughput.

Enabling Out-of-Order Execution: Key Mechanisms

Implementing Out-of-Order Execution requires several interacting hardware mechanisms:

Linking Producer to Consumer: As discussed, this is achieved through Register Renaming. Architectural register IDs are mapped to unique micro-architectural identifiers (tags) representing the locations where the values will be produced (e.g., a specific reservation station entry). This ensures that dependent instructions correctly identify where to obtain their required operands, even if multiple instructions target the same architectural register name.
Buffering Instructions: Instructions are placed into Reservation Stations after decoding and renaming. These buffers provide the waiting area where instructions reside until they are ready to execute, allowing subsequent instructions in the program order to continue through the earlier pipeline stages (fetch, decode, rename).
Tracking Readiness: Instructions in the reservation stations must monitor the availability of their source operands. When an instruction completes execution and produces a result, it broadcasts a tag identifying the produced value. Instructions in the reservation stations compare their source tags against the broadcast tag. If a match occurs for a source operand that was previously not ready, that operand’s readiness status is updated.
Dispatching Ready Instructions: When all source operands for an instruction in a reservation station become ready, the instruction “wakes up.” Logic is needed to select one or more ready instructions from the reservation stations to dispatch to the available functional units in a given cycle.

The process of broadcasting a tag and value upon completion and having waiting instructions capture this is often facilitated by a Common Data Bus (CDB) or similar broadcast network. This bus disseminates the results produced by functional units and allows dependent instructions (in reservation stations) or the register file (or register map) to update their state based on the tag.

Tomasulo’s Algorithm

The concepts of Out-of-Order Execution with register renaming and reservation stations were famously realized in Tomasulo’s Algorithm, developed by Robert Tomasulo at IBM for the IBM 360/91 Floating Point units in the 1960s.

Tomasulo’s algorithm provided a dynamic approach to scheduling that allowed floating-point operations to proceed out of order, effectively utilizing multiple functional units and tolerating the latency of floating-point computations. A key insight was using reservation stations attached to functional units and a Common Data Bus to communicate operands and results using tags, decoupling instruction execution from the strict program order and eliminating false dependencies.

While revolutionary, the original IBM 360/91 implementation lacked precise exceptions, a significant limitation for software reliability and debugging. Later research, particularly in the 1980s, demonstrated how to combine Tomasulo’s principles with precise exception handling using structures like the Reorder Buffer, paving the way for the ubiquitous adoption of OoO execution in modern processors.

Modern high-performance processors, including Intel Pentium Pro, AMD K5 and subsequent designs like Alpha 21264, MIPS R10000, IBM POWER series, AMD Zen, and Apple’s FireStorm cores, widely employ variations of Tomasulo’s algorithm and its associated concepts (renaming, reservation stations/issue queues, reorder buffers, precise exceptions).

The typical structure of a modern OoO pipeline involves decoding and renaming instructions, placing them in reservation stations (often organized by functional unit type or globally), dynamically scheduling them to functional units when ready, and finally, retiring them in program order via a reorder buffer to maintain precise state. This flow creates “two humps” or primary buffering stages in the pipeline: the reservation stations (scheduling window) and the reorder buffer (reordering/retirement window).

A high-level organization of an Out-of-Order processor illustrates these components: instruction fetch and decode, renaming logic, instruction buffers/reservation stations, functional units, and a reorder/commit unit interacting with the memory interface and register files.

Simulating Out-of-Order Dispatch

Let’s trace an example execution on a simplified Out-of-Order machine, focusing on the interaction between a Register Alias Table (RAT) for renaming and Reservation Stations (RS) for buffering and scheduling. For simplicity in this simulation, we omit the Reorder Buffer, although in a real processor, it would be essential for precise exceptions.

We will simulate a small sequence of MIPS-like instructions, assuming a limited number of reservation station entries for ADD and MUL units, and specific latencies for the functional units. The state of the RAT and RS will be tracked cycle by cycle.

The simulation begins with the RAT initialized, indicating all registers are valid and their values are the initial contents. The reservation stations are initially empty.

In Cycle 1, the first instruction (MUL R1, R2 → R3) is fetched.

In Cycle 2, the first instruction is decoded. It’s a MUL instruction, and assuming a MUL RS entry (say, ‘x’) is available, it’s allocated. The RAT is accessed for sources R1 and R2. R1 and R2 are valid (value=1, 2 respectively), so these values are copied into RS entry ‘x’. The destination register R3 is renamed to ‘x’ in the RAT; R3’s valid bit is cleared, and its tag is set to ‘x’. Since both sources for the instruction in RS ‘x’ are ready, this instruction is ready to execute in the next cycle.

In Cycle 3, the MUL instruction in RS ‘x’ begins execution in the MUL functional unit (assuming it was selected by the dispatch logic). It will take 6 cycles to complete. Concurrently, the next instruction (ADD R3, R4 → R5) is decoded. Assuming an ADD RS entry (‘a’) is available, it’s allocated. Source R3 is accessed in the RAT; it’s now invalid with tag ‘x’. This tag ‘x’ is copied into RS entry ‘a’ for source 1, indicating it’s waiting for the value from the instruction tagged ‘x’. Source R4 is valid (value=4), and its value is copied into RS entry ‘a’ for source 2. Destination R5 is renamed to ‘a’ in the RAT. The instruction in RS ‘a’ is not ready to execute because source 1 is waiting for tag ‘x’.

This process continues cycle by cycle, with new instructions being decoded, allocated into reservation stations, and renamed. Instructions in the RS monitor the broadcast tags. When an instruction finishes execution (out of order), it broadcasts its destination tag and value. Any instruction in an RS waiting for that tag captures the value and updates its readiness. Instructions become ready to execute once all their sources are valid (values captured). Ready instructions are then dispatched (selected and sent) to available functional units, potentially out of program order.

For example, in Cycle 8, the first MUL instruction completes its 6-cycle execution. It broadcasts its tag ‘x’ and value (2, from 1*2=2) on the CDB. The RAT, which has R3 tagged with ‘x’, captures the value 2 and marks R3 as valid. RS entry ‘a’, waiting for tag ‘x’, also captures the value 2 for its source 1, marking that source as valid. Now, the instruction in RS ‘a’ (ADD R3, R4 → R5) has both sources ready (source 1 value 2, source 2 value 4) and is ready to execute in the next cycle.

The simulation demonstrates how instructions become ready and execute out of their original program sequence. This is driven by the dynamic availability of operands, not the static program order. By tracking dependencies using tags and broadcasting results, the hardware effectively builds and executes a dynamic dataflow graph of the program fragment within the instruction window (reservation stations and reorder buffer).

After the simulation, the state of the RAT and RS reflects the result of decoding and renaming all instructions and the readiness status of their sources.

The hardware complexity for tag broadcast and value capture is significant, requiring extensive wiring (the CDB) and comparison logic throughout the reservation stations and register maps to detect matching tags. The critical path can involve tag broadcast, value capture, and the wake-up/select logic.

Dataflow in Out-of-Order Execution

As the simulation illustrates, an out-of-order engine dynamically constructs and executes a portion of the program’s dataflow graph. Instructions wait in reservation stations until their data dependencies are met (operands become ready), then they execute. The instruction window (RS + ROB) represents the segment of the program’s dataflow graph that the hardware is currently managing.

The latency tolerance capability of an OoO processor is limited by the size of the instruction window. A larger window can buffer more instructions, allowing the processor to look further ahead in the instruction stream for independent work when stalled. However, increasing the window size adds significant hardware complexity and can lengthen critical paths (like the tag broadcast/wake-up/select logic).

Refining the Out-of-Order Design: Centralized Value Storage

In the simulation, we observed that values were replicated in multiple locations: the RAT (initially), the Reservation Stations, and implicitly, they would also be stored in the Reorder Buffer in a precise implementation.

Replicating values in multiple buffers is expensive in terms of area and power, especially as values are wide (e.g., 64 bits) and buffer sizes increase.

A common optimization in modern OoO processors is to centralize the storage of all speculative and non-speculative register values in a dedicated Physical Register File (PRF). The RATs (frontend and backend) and Reservation Stations store pointers or physical register IDs (tags) to the PRF, rather than storing the values themselves.

With a centralized PRF:

The Frontend Register Map (or Rename Map, replacing the Frontend Register File/RAT from the simulation) stores the mapping from architectural register IDs to the physical register IDs in the PRF that hold the latest speculative values.
The Architectural Register Map stores the mapping from architectural register IDs to the physical register IDs that hold the architecturally correct, retired values. This map is only updated upon instruction retirement.
Reservation Station entries store physical register IDs for their source operands. When a value is broadcast, it is written to the corresponding physical register in the PRF, and reservation stations monitoring that physical register ID update their status.
Functional units write their results directly to the allocated physical register in the PRF and broadcast the physical register ID (tag).

This organization minimizes value replication and centralizes value management. In this refined model, the pipeline stages involve:

Decode/Rename: Allocate physical registers from a free list, update the frontend register map to point architectural registers to these physical registers, read source operand physical register IDs from the frontend map.
Issue Queue/RS: Instructions buffer here, waiting for their source physical registers to become ready in the PRF.
Execute: Instructions dispatched from RS read operands from the PRF (using physical register IDs) and execute.
Completion: Results written to the allocated physical register in the PRF, and the physical register ID is broadcast as a tag.
Retirement (via ROB): Oldest instruction updates the architectural register map to point architectural registers to the corresponding physical registers holding their results, and releases older, no-longer-needed physical registers back to the free list.

Summary of Out-of-Order Execution Concepts

Out-of-Order Execution enables high performance by breaking the in-order dispatch constraint of simpler pipelines. The core principles include:

Register Renaming: Eliminates false dependencies and links consumers to producers using tags (physical register IDs).
Reservation Stations (Issue Queue): Buffers instructions waiting for operands to become ready, forming the dynamic scheduling window.
Tag Broadcast & Value Capture: Communicates the readiness and value of produced results across the machine.
Wakeup and Select Logic: Dynamically identifies ready instructions in the RS and dispatches them to functional units.
Physical Register File: Centralizes value storage, accessed using physical register IDs.
Reorder Buffer: Ensures in-order retirement and precise exceptions.
Store Queue/Load Queue: Manages memory operations and their dependencies, including memory disambiguation.

These concepts are integrated into complex microarchitectures to maximize instruction-level parallelism and tolerate various latencies, albeit with significant hardware cost and design complexity.

Continue here: 15.5 Load-Store Handling in Out-of-Order Execution

CS Notes

Explorer