16 Superscalar Execution & Branch Prediction

Lecture from: 27.04.2023 | Video: YT

This lecture focused on advanced microarchitectural techniques for extracting instruction-level parallelism, building upon prior discussions of pipelining and Out-of-Order (OoO) execution. Key topics included the relationship between OoO and dataflow, the principles of Superscalar execution, and, predominantly, the crucial role of Branch Prediction in maintaining pipeline efficiency.

Review: Out-of-Order Execution as Restricted Dataflow

The concept of Out-of-Order execution was reviewed as a microarchitectural implementation of dataflow principles. An OoO processor dynamically identifies instructions whose operands are ready, independent of their program order, and schedules them for execution. This dynamic scheduling is achieved through mechanisms such as register renaming for linking producers to consumers, reservation stations for buffering instructions, a broadcast mechanism for tracking operand readiness (tag and value broadcast), and logic for selecting ready instructions for dispatch to functional units.

This dynamic process effectively builds and executes a dataflow graph representing a portion of the program. This segment is limited by the size of the instruction window (the set of decoded instructions in the reorder buffer and reservation stations). The state of the hardware structures, such as the Register Alias Table (RAT) and reservation stations, at any given moment reflects this dynamic dataflow graph being processed by the machine. An example illustrating the reverse-engineering of a dataflow graph from a snapshot of the RAT and reservation stations was discussed in the previous lecture.

This microarchitectural approach differs from pure dataflow at the Instruction Set Architecture (ISA) level. While ISA-level dataflow could potentially expose more parallelism by allowing programmers to explicitly represent programs as dataflow graphs, it introduces significant complexities in state management, debugging, and exception handling, which are handled more gracefully by preserving the sequential semantics of the Von Neumann model at the ISA level and implementing dataflow concepts underneath in the microarchitecture, as OoO execution does.

Despite the performance benefits, the complexity and hardware cost associated with OoO execution, particularly scaling the instruction window and managing memory dependencies (addressed in Lecture 15b), remain significant challenges.

Superscalar Execution: Adding Width to the Pipeline

Superscalar Execution is another fundamental technique for increasing instruction-level parallelism, distinct from OoO execution. It involves fetching, decoding, executing, and potentially retiring multiple instructions concurrently in the same clock cycle.

An N-wide superscalar processor can fetch and attempt to issue up to N instructions simultaneously. This necessitates the duplication of hardware resources, such as instruction fetch/decode units, register file ports, functional units (e.g., multiple ALUs, multipliers), and memory ports. The diagram of an in-order superscalar processor illustrates how various pipeline stages are widened to handle multiple instructions in parallel.

A critical challenge in superscalar processors, especially in-order designs, is performing dependence checking between instructions that are processed concurrently within the same cycle. Unlike the horizontal dependence checking between instructions in consecutive pipeline stages, superscalar designs require vertical checking between instructions side-by-side in the same processing stage (e.g., decoded in the same cycle).

In an in-order superscalar processor, if dependencies exist within a group of concurrently fetched instructions, the later instructions in that group (in program order) must be prevented from executing incorrectly, often leading to stalls or requiring instructions to be re-issued in subsequent cycles. An example illustrating how dependencies between concurrently-fetched instructions can reduce the actual Instructions Per Cycle (IPC) below the ideal maximum was discussed.

Out-of-Order execution and Superscalar execution are orthogonal concepts, meaning a processor can be:

Scalar In-Order (basic pipeline)
Scalar Out-of-Order
Superscalar In-Order
Superscalar Out-of-Order

Modern high-performance processors are predominantly Superscalar Out-of-Order, combining the benefits of both techniques: Out-of-Order execution for latency tolerance and dynamic parallelism extraction, and Superscalar execution for increasing instruction throughput by processing multiple instructions concurrently.

The advantages of superscalar execution include a direct increase in instruction throughput (IPC). However, it adds significant hardware complexity and cost, potentially increasing critical path delays (and thus clock cycle time) due to the added logic for dependence checking and resource arbitration.

Historical examples show the evolution from scalar in-order to superscalar in-order (e.g., early Intel Pentium, Alpha 21164) and then to OoO superscalar (e.g., Intel Pentium Pro, Alpha 21264, Intel Pentium 4, AMD Zen, Apple M1). The complexity grew significantly with each step, but so did the potential for performance.

Branch Prediction: A Necessity for Performance

For high-performance pipelines, particularly those that are deep and wide (superscalar), maintaining a high instruction fetch rate is critical. This is severely challenged by control dependencies, especially branches, whose outcomes determine the subsequent instruction flow but are often resolved late in the pipeline.

Conditional branches, returns, and indirect branches pose significant problems because the address of the next instruction is not definitively known at the fetch stage. Conditional branches also introduce uncertainty in direction (taken or not taken).

Stalling the pipeline while waiting for a branch outcome is prohibitive in deep and wide pipelines, leading to substantial waste of valuable execution slots.

Branch Prediction is the primary mechanism to mitigate this performance loss. It involves speculatively guessing the branch outcome and the address of the next instruction to fetch, allowing the pipeline to continue processing.

If the prediction is correct, the speculation is successful. If incorrect, the processor must flush the speculatively fetched and executed instructions from the wrong path, incurring a misprediction penalty. This penalty is proportional to the pipeline depth (cycles to resolve) and width (instructions per cycle). In modern processors, this can amount to many wasted cycles.

Improving branch prediction involves increasing the probability of correct guesses and/or reducing the penalty of wrong guesses. Most research and design effort focuses on the former. Predicting the next fetch address accurately requires predicting three things at the fetch stage: whether the instruction is a branch, its direction, and its target address (if taken).

Enhanced Branch Predictors

Target Address Prediction

For direct branches, the target address can be cached in a Branch Target Buffer (BTB), indexed by the branch’s PC. A hit in the BTB provides the predicted target address and signals that the instruction is a likely branch.

Predicting targets for indirect branches (like function returns or jump tables) is more complex, often involving structures like a Return Address Stack (RAS) or history-based target predictors.

Branch Direction Prediction

Various schemes aim to predict the direction of conditional branches:

Static Predictors

Rely on compile-time information or code properties (e.g., “always taken,” “backward taken, forward not taken”). Profile-based prediction uses profile runs to encode a likely direction. Their accuracy is limited by the program’s dynamic behavior and the representativeness of the profile.

We’ll continue with other type of predictors next time…

Continue here: 17 Advanced Branch Prediction

CS Notes

Explorer