17 Advanced Branch Prediction

Lecture from: 28.04.2023 | Video: YT

Continuing from the previous lecture’s introduction to Branch Prediction, this session delved into more advanced techniques employed in modern high-performance processors to predict branch outcomes and manage control flow effectively.

The Challenge of Control Dependencies and the Need for Prediction

As discussed in preceding lectures, control dependencies, primarily branches (conditional, indirect, and returns), pose a significant challenge for pipeline efficiency. Unlike sequential instructions where the next Program Counter (PC) is simply PC + instruction size, the next PC after a control-flow instruction is determined by the branch’s type and outcome, which is often not resolved until late in the pipeline. This delay disrupts the flow of instructions to the pipeline front-end, leading to stalls and wasted instruction slots.

Stalling the pipeline until a branch’s outcome is known is highly detrimental to performance, especially in deep and wide superscalar pipelines where the misprediction penalty (measured in wasted instruction slots) is the product of pipeline depth and width.

Branch prediction addresses this by speculatively guessing the branch’s outcome and the next PC. This allows the processor to continue fetching instructions down the predicted path. If the prediction is correct, the speculation is successful. If incorrect, the wrong-path instructions must be flushed, incurring the misprediction penalty.

Effective branch prediction aims to maximize prediction accuracy and minimize the misprediction penalty. Modern processors employ sophisticated hardware predictors to achieve high accuracy, as even a small percentage of mispredictions can severely impact performance in deep and wide pipelines.

Basic Dynamic Branch Prediction Schemes

Prediction at the fetch stage requires predicting whether an instruction is a branch, its direction (if conditional), and its target address (if taken). The Branch Target Buffer (BTB) assists with the first and third, indicating if the PC corresponds to a previously seen branch and providing its likely target address. The key challenge then becomes predicting the direction of a conditional branch accurately.

Basic dynamic direction prediction schemes rely on the recent history of the branch itself:

Last-Time Predictor

Last-Time Predictor (1-bit): Predicts that a branch will take the same direction it took the last time it was executed. A single bit associated with the branch (often stored in the BTB entry) records the last outcome. While simple, this predictor performs poorly for branches that frequently alternate direction, such as the first and last iterations of a loop.

Two-Bit Predictor

Two-Bit Counter (2BC) Predictor (Bimodal): Uses a 2-bit saturating counter for each branch. This adds hysteresis, meaning the prediction does not change based on a single opposing outcome. The counter state transitions require two consecutive outcomes in one direction to change the prediction strongly to that direction. This improves accuracy for branches that are mostly biased but occasionally take the opposite direction.

While 2BC predictors offer improved accuracy (typically 80-90%), this may still not be sufficient for the aggressive performance targets of modern processors, given the high misprediction penalty.

Two-Level Branch Prediction: Exploiting Correlation

More advanced dynamic predictors exploit correlation between branch outcomes. Two primary types of correlation are leveraged.

Global Branch Correlation

The outcome of a branch can be correlated with the outcomes of other recently executed branches in the program’s execution path. A Global History Register (GHR) tracks the outcomes (taken/not taken) of the most recent branches, forming a global history pattern. This GHR value is then used to index into a Pattern History Table (PHT), which contains 2-bit counters. The counter in the indexed PHT entry provides the prediction, reflecting the direction taken by branches last time this specific global history pattern was observed.

Local Branch Correlation

The outcome of a branch can be correlated with the past outcomes of that specific branch (beyond just the last instance, as in the 1-bit or 2BC predictors). A Local History Register is associated with each branch (or a set of branches) to track its own recent history. This local history is then used to index into a PHT.

Two-level predictors (Global or Local) generally achieve higher accuracy than simple history schemes by capturing more complex patterns and correlations in branch behavior. The PHT often uses the 2-bit counters for added hysteresis.

A taxonomy exists classifying two-level predictors based on whether the Branch History Register (BHR) is Global (G), per set of branches (S), or per branch (P), and whether the Pattern History Table (PHT) is Global (g), per set of branches (s), or per branch (p). Common configurations like GAg (Global GHR, Global PHT) and PAg (Per-address GHR, Global PHT) have been explored. The Gshare predictor is a popular variant that hashes the GHR with the branch PC to index the PHT, reducing aliasing and improving PHT utilization.

Hybrid and Advanced Predictors

Recognizing that different branches may exhibit different predictability characteristics (some benefit more from local history, others from global, some are highly biased), modern processors employ Hybrid or Tournament Predictors. These combine multiple prediction schemes and use a meta-predictor (or selector) to choose the prediction from the scheme deemed most likely to be correct for a given branch instance.

The Alpha 21264 Tournament Predictor is a well-known example, combining a global predictor, a local predictor, and a meta-predictor that chooses between them based on global history.

Further advancements include:

TAGE predictor

A state-of-the-art predictor that uses multiple PHTs (pattern history tables) indexed with different history lengths, dynamically choosing the predictor based on the longest matching history. This is effective at capturing correlations of varying lengths.

Perceptron Predictors

Apply machine learning principles to learn correlations between history bits and branch outcomes using a perceptron (a simple neural network).

Biased Branch Filtering

Identifies highly biased branches (e.g., always taken or always not taken) and predicts them using simpler, dedicated mechanisms (like a static prediction or last-time predictor), preventing them from interfering with and polluting the tables used by more complex predictors for less biased branches.

Branch Confidence Estimation

Predicts how likely a specific branch prediction is to be correct. This confidence score can be used to guide speculative execution, for example, by reducing fetching or switching to alternative handling mechanisms (like multi-path execution) when confidence is low.

Other Methods

Beyond prediction, other methods for handling control dependencies were briefly noted or referenced:

Delayed Branching

Changes ISA semantics to delay branch effect, requiring compiler to fill delay slots.

Predicated Execution

Converts control dependencies to data dependencies, executing instructions conditionally based on predicate bits, eliminating branches.

Multi-Path Execution

Speculatively fetches and executes instructions from multiple possible paths after a branch.

While these alternatives exist, Branch Prediction, particularly dynamic and hybrid schemes, is the dominant approach in modern high-performance general-purpose processors due to its effectiveness in hiding latency with acceptable hardware complexity and minimal ISA impact.

Related challenges in wide fetch engines include instruction alignment within cache lines and “fetch breaks” caused by branches within the fetch block. Compiler techniques (like basic block reordering and superblocks) and hardware techniques (like trace caches) are used to mitigate these issues.

Continue here: 18 VLIW and Systolic Array Architectures

CS Notes

Explorer