Lecture from: 20.04.2023 | Video: YT

Building upon previous discussions of single-cycle and multi-cycle microarchitectures and the introduction to pipelining, this lecture focused on the critical challenge of maintaining precise architectural state, particularly in the context of exceptions and interrupts, within processors that may execute instructions out of their original program order.

Complications Arising from Multi-Cycle Execution

As previously discussed, processors employing multi-cycle or pipelined microarchitectures may utilize functional units with varying execution latencies.

This variability can lead to instructions completing their execution phases out of their original program order. For instance, a computationally intensive instruction like division, requiring multiple cycles in the Execute stage, might start before a simpler addition but finish long after the addition has completed.

The potential for instructions to complete and write their results to the architectural state (registers and memory) out of the program’s defined sequential order creates a fundamental challenge. While this out-of-order completion is key to improving performance by keeping hardware resources busy, it complicates the handling of unexpected events like exceptions and interrupts.

Exceptions and Interrupts: Unplanned Events

Exceptions and interrupts represent deviations from the normal instruction flow, necessitating immediate attention from the processor. They require suspending the currently executing program and transferring control to a designated handler routine.

Exceptions originate from issues encountered during the execution of a specific instruction within the program. Examples include arithmetic errors such as division by zero or overflow, attempts to execute invalid or undefined instructions, memory access violations due to protection issues or page faults.

Interrupts, conversely, stem from events external to the current program, often related to I/O devices requiring service (like keyboard input or network data arrival), timer expirations, or hardware errors such as power failures.

Both exceptions and interrupts share common handling requirements: saving the current program’s state (including the Program Counter and registers), halting its execution, and initiating a transfer of control to the appropriate handler routine. In some cases, the program may be resumed after the event has been addressed.

Examining the nature of these events, exceptions are typically synchronous to the instruction stream and must be handled when detected (unless they arise from speculatively executed instructions that are later squashed). Interrupts, being asynchronous, can sometimes be handled with flexibility, although high-priority interrupts demand immediate service.

The Requirement for Precise Exceptions

The concept of precise exceptions is central to maintaining the integrity and predictability of program execution in complex microarchitectures. A precise exception requires that when the processor initiates an exception or interrupt handler, the architectural state observed by the handler is identical to what it would be if instructions were executed strictly in their original program order, and the exception occurred precisely after the offending instruction.

This means two conditions must be met:

  1. All instructions that precede the excepting instruction in the program’s original sequence must have completed execution and all their side effects (register updates, memory writes) must be fully reflected in the architectural state.
  2. No instruction that follows the excepting instruction in the program’s original sequence should have made any visible changes to the architectural state.

The act of an instruction permanently updating the architectural state is called retirement or commit. Ensuring precise exceptions mandates that instructions retire in program order.

A practical view of this requirement is found in processor manuals, such as the x86-64 ISA.

The importance of precise exceptions cannot be overstated. It provides the programmer and the operating system with a consistent, sequential view of the program’s state at the point of an error. This is crucial for effective software debugging, enabling the programmer to understand exactly what happened and why, without being confused by out-of-order side effects. Furthermore, precise exceptions facilitate robust recovery and restartability of programs after an exception is handled and support the implementation of complex operations via software traps.

Ensuring precise exceptions is straightforward in simple microarchitectures like single-cycle and basic multi-cycle designs because instruction execution completes in strict program order.

In multi-cycle designs controlled by an FSM, specific states can be dedicated to handling exceptions, transitioning to the handler routine only after the offending instruction sequence is identified and all preceding instructions have completed.

However, in pipelined processors where instructions execute and complete out of order to maximize throughput, guaranteeing precise exceptions becomes significantly more complex.

Simply forcing all operations to take the same maximum latency to ensure in-order completion is an inefficient approach that sacrifices performance.

Therefore, specialized hardware mechanisms are required to manage the out-of-order execution and completion while preserving the in-order retirement necessary for precise exceptions.

Hardware Solutions for Precise Exceptions

Several hardware structures and techniques are employed in modern processors to support precise exceptions in the presence of out-of-order completion.

Reorder Buffer

The Reorder Buffer is a fundamental hardware structure used to ensure in-order retirement and precise exceptions.

Core Idea

Instructions are allowed to complete their execution out of program order, but their results are temporarily stored in the ROB. The architectural state is only updated when an instruction is the oldest in the ROB and has finished without exceptions.

Mechanism

As instructions are fetched and decoded in program order, an entry is allocated for each in the ROB, maintaining the program sequence.

When an instruction finishes execution (potentially out of order), its result is written into its designated entry in the ROB.

Instructions retire from the ROB in program order. The processor continuously checks the oldest entry in the ROB. If that instruction has completed and did not cause an exception, its result is written to the architectural register file or memory, and the ROB entry is released. If the oldest instruction did cause an exception, the processor can discard all subsequent instructions in the ROB (as they haven’t updated architectural state yet) and initiate the exception handler, guaranteeing a precise state.

Accessing Data from the ROB

A challenge arises when an instruction needs an operand produced by an instruction that has completed and written its result to the ROB but hasn’t retired yet.

The dependent instruction needs to obtain this value from the correct ROB entry. Identifying the correct entry based on the architectural register ID (the operand needed) involves searching the destination register ID fields of the ROB entries. This search across multiple entries simultaneously resembles a Content-Addressable Memory (CAM) lookup, which can be complex and slow if implemented naively.

Simplifying ROB access and register renaming

To avoid slow CAM lookups, processors use indirection. The architectural register file is augmented. Each entry can indicate whether its value is currently valid in the architectural register file or if a pending instruction will produce it. If pending, the register file entry stores a “tag” that identifies the location where the result will appear (e.g., the ROB entry number). When an instruction needs an operand, it checks the register file. If the value is valid, it’s read directly. If pending, the tag is used to directly index into the ROB or another buffer to retrieve the value. This eliminates the need for a full CAM search.

This mechanism forms the basis of Register Renaming. Architectural register names are mapped dynamically to physical storage locations (like ROB entries or a larger pool of physical registers). This mapping allows multiple instructions to use the same architectural register name without creating false WAR or WAW dependencies, because each instruction is allocated a unique physical storage location for its result. This increases parallelism and allows more instructions to be “in flight.”

Modern processors often use a Register Alias Table (RAT) to manage this mapping between architectural register names and physical storage locations.

ROB in the Pipeline

The ROB integrates into the pipeline flow. During the Decode stage, an instruction is allocated a ROB entry and its operands are obtained from the register file or forwarded from earlier pipeline stages or the ROB (using tags). Execution can then proceed potentially out of order. Upon completion, results are written to the ROB. Retirement from the ROB happens in order, updating the architectural state.

Trade-offs of the ROB

The ROB adds hardware complexity and overhead. Accessing values in the ROB can introduce latency, although techniques like tagging and indirection help. However, its conceptual simplicity for precise exceptions and its enablement of register renaming for performance are significant advantages.

Alternate Solutions

Other techniques exist, though they might be less common as the primary mechanism in modern high-performance processors compared to the ROB-centric approach.

History Buffer

Instead of buffering results before write-back, the HB allows instructions to write results to the architectural register file immediately. However, the old value of the destination register is saved in the HB. On an exception, the old values are read back from the HB to restore the precise state.

Future File + ROB

This approach uses two register files: a “Future File” (or frontend register file) that is updated speculatively for fast access to the latest values, and an “Architectural Register File” (or backend register file) updated only on instruction retirement for precise state recovery. The ROB is still used for managing in-order retirement.

Handling Memory State (Stores)

Maintaining precise state extends to memory as well. Store instructions, which write data to memory locations, must also appear to complete in program order for precise exceptions. This is challenging because stores, like other instructions, may complete their address and data generation out of order.

Challenge: If a younger store updates memory and an older instruction subsequently causes an exception, the memory update by the younger instruction must be undone, which is more complex than undoing a register write.

Solution: Processors typically use a Store Buffer (often integrated with the ROB) to hold the address and data for completed but not yet retired store instructions. Memory is only updated when a store instruction reaches the head of the ROB and retires. Load instructions that access memory might need to check the pending stores in the Store Buffer to ensure they get the most up-to-date value according to program order (known as store-load forwarding or memory disambiguation). This complex interaction between loads and pending stores is a key aspect of out-of-order execution memory handling.

Branch Mispredictions and State Recovery

Branch mispredictions, while micro-architectural events not visible to software, share similarities with exceptions in that they disrupt the instruction flow and necessitate state recovery. When a branch is mispredicted, instructions fetched from the wrong path must be flushed, and the pipeline must restart fetching from the correct target address.

The state recovery mechanisms discussed for exceptions (ROB, HB, FF) can be adapted for branch misprediction recovery. However, branches are far more frequent than exceptions, demanding much lower state recovery latency.

This need for rapid recovery motivated the development of techniques specifically optimized for fast frontend state restoration, such as Checkpointing. Checkpointing involves saving a snapshot of the frontend state (like the Future File or register map) when a branch is decoded. If the branch is mispredicted, this checkpoint allows for a quick restoration of the state to what it was before the mispredicted branch, minimizing the penalty.

Summary

In conclusion, enabling instructions to complete execution out of their original program order is vital for high-performance pipelined and out-of-order processors. However, this introduces challenges in maintaining a precise architectural state when exceptions or interrupts occur.

The primary hardware solutions to this challenge are mechanisms like the Reorder Buffer, History Buffer, Future File, and Checkpointing. These structures manage the instruction flow and state updates to ensure that despite out-of-order execution and completion, instructions retire in program order, allowing for precise exceptions and providing a consistent, predictable state for software.

Continue here: 15 Out-of-Order Execution