25 Prefetching, RL based PF, Execution-Based PF, Virtual Memory

Lecture from: 01.06.2023 | Video: YT

Advanced Prefetching

This lecture continues the discussion on prefetching, moving from basic stride prefetchers to more sophisticated and adaptive prefetching mechanisms.

Recall: Prefetching - The Four Key Questions

To recap, effective prefetching design hinges on addressing:

What to prefetch (address prediction).
When to prefetch (timeliness).
Where to place prefetched data and the prefetcher itself.
How the prefetcher operates (software, hardware, execution-based, etc.).

We previously discussed stride prefetchers, which are good for simple, regular access patterns but struggle with complex or irregular patterns.

Self-Optimizing Memory Prefetchers: Pythia

A more advanced approach involves prefetchers that can learn and adapt their behavior. Pythia is an example of a customizable hardware prefetching framework that uses online reinforcement learning (RL).

Basics of Reinforcement Learning (RL)

RL is an algorithmic approach where an agent learns to take an action in a given situation (state) to maximize a numerical reward.

The agent interacts with an environment, observes the current state, takes an action, and receives a reward (or penalty). Over time, the agent learns a policy (e.g., by storing Q-values for state-action pairs) to select actions that yield the highest long-term cumulative reward.

Pythia: Prefetching as an RL Problem

Agent: The prefetcher.
Environment: The processor and memory subsystem (including the running application).
State: A vector of features derived from the current memory request and system status. Examples include:
- Control-flow features: Program Counter (PC) of the demand access, branch PC, history of recent PCs.
- Data-flow features: Cache line address, physical page number, delta (stride) between current and previous cache line addresses, history of recent deltas.
Action: Given a demand access to address A, the prefetcher selects a prefetch offset O. The prefetch generated is for address A+O. A zero offset means no prefetch.
Reward: A numerical value that defines Pythia’s objective. It encapsulates:
- Prefetch usefulness: Was the prefetch accurate? Was it timely (not too early, not too late)? Did it cross a page boundary unnecessarily?
- System-level feedback: How did the prefetch impact overall system metrics like memory bandwidth usage, cache pollution, or energy consumption?

Pythia’s Operation

Pythia uses a Q-Value Store to record Q-values for state-action pairs and an Evaluation Queue (a FIFO) to track recently taken prefetch actions and assign rewards.

On a demand request, a state vector is formed.
The Q-Value Store is queried with this state to find the action (prefetch offset) with the maximum Q-value.
A prefetch is generated (if the offset is non-zero).
The prefetch action and state are enqueued in the Evaluation Queue.
When the prefetch completes or its usefulness is determined (e.g., the prefetched line is used or evicted), a reward is calculated.
This reward is used to update the Q-value for the corresponding state-action pair in the Q-Value Store, refining the prefetcher’s policy.

Pythia has shown to consistently provide high performance across various core counts and DRAM bandwidth configurations, often outperforming prior state-of-the-art prefetchers, especially by adapting to system conditions like available bandwidth. It is open-source, allowing further research and development.

Real Systems: Hybrid Hardware Prefetchers

Modern processors often employ multiple, different prefetching mechanisms simultaneously to cover a wider range of memory access patterns. For instance, a system might have a simple stream prefetcher alongside a more complex stride or pattern-based prefetcher.

Pros: Better overall prefetch coverage and potentially better timeliness due to specialization.
Cons:
- Increased complexity in design and interaction.
- Higher potential bandwidth consumption.
- Prefetchers can interfere with each other (e.g., by contending for resources or issuing conflicting prefetches), potentially leading to cache pollution. This necessitates mechanisms to manage and throttle accesses from each prefetcher.

Prefetching in Multi-Core Systems

Prefetching in multi-core environments introduces additional challenges:

Prefetching Shared Data: Prefetching data shared between cores can trigger coherence misses if not handled carefully or if the prefetched data is invalidated by another core before use.
Efficiency is Critical: Bus bandwidth and cache space are even more precious resources when shared among multiple cores. Inefficient prefetching (low accuracy) has a higher negative impact.
Inter-Core Interference: Prefetches from one core can:
- Cause cache conflicts at various levels for other cores.
- Contend for shared bus/interconnect bandwidth.
- Create contention in DRAM (bank, rank, channel, row buffer).
This often necessitates throttling prefetchers to prevent them from overwhelming the memory system or unfairly impacting other cores. Research explores prefetch-aware resource management and DRAM control.

Execution-Based Prefetching

A powerful prefetching paradigm that pre-executes parts of the program solely to generate prefetch requests.

Idea: Instead of inferring patterns from past accesses, execute a “slice” of the program’s future code path speculatively.
Speculative Thread: The pre-executed program piece can be considered a “thread.” Its purpose is not to commit architectural state but to touch memory locations that the main program will likely access, thereby pulling them into the cache.
Execution Contexts: The speculative thread can run on:
- A separate processor/core.
- A separate hardware thread context (e.g., SMT).
- The same thread context as the main program, typically during idle cycles when the main program is stalled (e.g., on a long-latency cache miss). This is the basis of Runahead Execution.

Runahead Execution

Runahead Execution is a specific type of execution-based prefetching designed to mitigate the performance impact of long-latency cache misses, particularly those that cause the instruction window in an out-of-order processor to fill up and stall.

Long-latency cache misses are a primary cause of such “full-window stalls.” While larger instruction windows can tolerate more latency, they are complex and costly to build.

Runahead Mechanism

Entry: When the oldest instruction in the window is a long-latency cache miss (e.g., L2 miss):
- The processor checkpoints the architectural state (registers).
- It enters “runahead mode” (a speculative pre-execution mode).
In Runahead Mode:
- The processor speculatively pre-executes instructions down the predicted program path, without stalling for most long-latency operations.
- The primary goal is to generate prefetches for both data and instructions.
- Instructions dependent on the original long-latency miss (whose values are unknown) are marked as “invalid” (INV) and quickly removed from the window to make space for new instructions. Their results are not trusted and do not update architectural state.
Exit: When the original long-latency miss (that triggered runahead mode) returns from memory:
- The processor restores the checkpointed architectural state.
- The pipeline is flushed.
- Normal (non-speculative) execution resumes from the instruction that caused the original miss.

Benefits of Runahead Execution

Effectively acts as a prefetcher by executing future code paths during stalls.
Generates very accurate prefetches for both data (regular and irregular patterns) and instructions because it follows the program’s actual (predicted) control flow.
Hardware prefetchers and branch predictors can be trained using the “future” access information seen during runahead mode.
Relatively simple to implement, as it reuses much of the existing out-of-order execution core logic.
No separate hardware context is needed for the pre-execution.

Disadvantages/Limitations

Extra executed instructions: Speculative execution in runahead mode consumes energy.
Branch prediction accuracy: Performance is limited by how accurately runahead mode can follow the correct program path.
Dependent cache misses: Cannot prefetch data if the address calculation depends on a prior unresolved long-latency miss (e.g., pointer chasing: load r2, [r1]; load r3, [r2]).
MLP and Prefetch Distance: Effectiveness is tied to how much memory-level parallelism can be exposed and how far ahead prefetches can be issued, which is limited by the duration of the original miss.

Runahead execution has been implemented in various commercial processors (e.g., Sun ROCK, IBM POWER6, NVIDIA Denver) and has demonstrated significant performance improvements.

Runahead Enhancements

Subsequent research has focused on addressing the limitations of the baseline runahead mechanism, such as:

Energy inefficiency: Techniques to make runahead processing more efficient by reducing useless speculative work.
Ineffectiveness for pointer-intensive applications: Address-Value Delta (AVD) prediction to predict the values of pointer- Mloading instructions, thereby breaking miss dependencies.
Irresolvable branch mispredictions: Methods to detect and handle wrong-path execution during runahead mode more effectively.

The development of Runahead Execution in the early 2000s was a shift from the then-dominant focus on simply building larger and more complex instruction windows. It emphasized using the existing core in a smarter, speculative way primarily for prefetching during unavoidable stalls, showcasing the importance of addressing the memory latency problem creatively.

Virtual Memory

Virtual memory is a fundamental concept in modern computer systems that provides an abstraction layer between the addresses used by a program (virtual addresses) and the actual physical addresses in main memory.

Memory from the Programmer’s View

From a programmer’s perspective, memory often appears as a large, contiguous address space. They perform load and store operations to this space without needing to worry about the physical constraints of the memory system.

Ideally, this memory would have zero access time, infinite capacity, zero cost, infinite bandwidth, and zero energy consumption. Reality is far from this ideal.

Abstraction: Virtual vs. Physical Memory

Programmer’s View: Sees virtual memory, often assumed to be “infinite” or at least very large (e.g., the full 64-bit address space).
Reality: Physical memory (DRAM) is much smaller than the virtual address space a program can address.
The System’s Role: A cooperative effort between system software (Operating System - OS) and hardware (Memory Management Unit - MMU) maps virtual memory addresses to physical memory addresses. This system automatically and transparently manages the physical memory space.

Benefits:

+: Programmer Convenience: The programmer doesn’t need to know or manage the physical size of memory. A small physical memory can appear vast to each program. Life is made easier.
-: System Complexity: Requires more complex system software and hardware architecture.

This is a classic programmer/architect tradeoff. It necessitates indirection and mapping between virtual and physical address spaces.

Benefits of Automatic Management of Memory (via Virtual Memory)

Programmer deals only with virtual addresses: Simplifies programming.
Per-Process Private Address Space: Each process has its own independent virtual address space. The same virtual address in two different processes can map to different physical addresses (or the same if memory is shared).
Enables:
- Relocation: Code and data can be loaded anywhere in physical memory because the mapping is flexible. The OS can move pages around without the program’s knowledge.
- Protection and Isolation: Code and data of different processes can be kept separate in physical memory. One process cannot easily (or at all, without permission) access another process’s memory.
- Sharing: Allows controlled sharing of code (e.g., libraries) and data between multiple processes by mapping different virtual pages from different processes to the same physical page.

A System with Only Physical Memory

Early systems (some supercomputers, early PCs, many older embedded systems) operated with only physical addresses. The CPU’s load/store instructions directly generated physical memory addresses.

This approach faces several difficulties:

Programmers must manage limited physical memory space.
Code/data relocation is hard.
Supporting multiple concurrent processes is problematic (protection, isolation, memory allocation conflicts).
Sharing data/code across processes is cumbersome.
The ISA’s address space cannot exceed the physical memory size without complex programmer-managed overlays.

Virtual Memory: The Core Idea

To overcome these difficulties, virtual memory gives each program the illusion of a large, private address space while using a smaller amount of physical memory. The programmer doesn’t need to worry about managing physical memory, either within their own process or in relation to other processes. Hardware and software cooperatively and automatically manage the physical memory space to provide this illusion for each independent process.

Basic Mechanism of Virtual Memory

Indirection and Mapping: At the heart of virtual memory.
Virtual Address: Addresses generated by instructions in a program are virtual addresses. These are not the addresses directly used to access main memory. (x86 calls this “linear address”).
Physical Address: An address translation mechanism maps the virtual address to a physical address, which is then used to access main memory. (x86 calls this “real address”).
This translation is a cooperative hardware/software effort.

This diagram shows two processes, each with its own large virtual address space (e.g., 256TB). Portions of these virtual spaces (called “virtual pages”) are mapped to “physical pages” (or “frames”) in the much smaller physical memory (e.g., 8GB). Unmapped or evicted pages reside on a backing store like a disk.

Page-Based Virtual Memory System

Address Translation: The hardware (MMU) converts virtual addresses into physical addresses. This usually involves an OS-managed lookup table called the Page Table.
Virtual Pages and Physical Frames:
- The virtual address space is divided into fixed-size blocks called pages.
- The physical address space is divided into fixed-size blocks called frames (typically the same size as pages).
Mapping: A virtual page can be mapped to:
- A physical frame, if the page currently resides in physical memory.
- A location on disk (or other backing store), if the page is not in physical memory.
Page Fault: If an accessed virtual page is not in physical memory (its mapping indicates it’s on disk or invalid):
- An exception (a page fault) is triggered.
- The OS handles the page fault.
- Demand Paging: The OS brings the required page from disk into an available physical frame.
- The page table is updated to reflect the new mapping.
- The faulting instruction is typically restarted.
Page Table: A data structure (usually in memory, managed by the OS) that stores the mappings of virtual pages to physical frames.

Physical Memory as a Cache

In essence, physical memory acts as a cache for pages stored on disk.

It’s often a fully-associative cache in the sense that any virtual page can potentially be mapped to any physical frame.
Many standard caching issues apply:
- Placement: How to find/place a page in physical memory (mapping).
- Replacement: If physical memory is full, which page (physical frame) to evict to make room for a new page from disk.
- Granularity: Page size (analogous to block size).
- Write Policy: Typically write-back for pages (dirty bit in page table entry indicates if a page in memory has been modified and needs to be written to disk upon eviction).

Analogies between Cache and Virtual Memory

Cache Term	Virtual Memory Term
Block	Page
Block Size	Page Size
Block Offset	Page Offset
Miss	Page Fault
Index (for set)	Virtual Page Number (VPN)
Metadata (Tag Store)	Page Table
Data Store	Physical Memory

Address Translation Details

A virtual address is typically split into two parts:

Virtual Page Number (VPN): The higher-order bits, used to index the page table.
Page Offset: The lower-order bits, indicating the byte offset within the page. This offset is the same for both virtual and physical addresses and is not translated.

The translation process converts the VPN into a Physical Page Number (PPN) or Physical Frame Number (PFN). The physical address is then formed by concatenating the PPN with the original Page Offset.

Example

Virtual memory size: 2 GB = $2^{31}$ bytes
Physical memory size: 128 MB = $2^{27}$ bytes
Page size: 4 KB = $2^{12}$ bytes
Virtual address: 31 bits
Physical address: 27 bits
Page offset: 12 bits
VPN bits: $31 - 12 = 19$ bits (so $2^{19}$ virtual pages)
PPN bits: $27 - 12 = 15$ bits (so $2^{15}$ physical frames)

The Page Table

Contains an entry for each virtual page in a process’s address space.
Each Page Table Entry (PTE) contains at least:
- Valid bit (V): Indicates if the mapping is valid and the page is in physical memory.
- Physical Page Number (PPN/PFN): The physical frame where the virtual page is located (if V=1).
- Other bits:
  - Replacement policy information (e.g., reference bit for CLOCK algorithm).
  - Dirty/Modified bit (for write-back).
  - Permission/Access control bits (Read, Write, Execute, Supervisor/User).

Translation Process using a Page Table

The CPU generates a virtual address.
The VPN is extracted.
The Page Table Base Register (PTBR) (a special CPU register, loaded by the OS for the current process) holds the physical base address of the current process’s page table.
The address of the PTE is calculated: PTE_address = PTBR + (VPN * PTE_size).
The MMU (or a hardware page walker) reads the PTE from this memory location.
If PTE.Valid is 1:
- The PPN is extracted from the PTE.
- The physical address is formed: PA = (PTE.PPN << PageOffsetBits) | PageOffset.
- Access control bits in the PTE are checked. If the access is permitted, the memory access proceeds. If not, a protection fault (exception) occurs.
If PTE.Valid is 0:
- A page fault exception occurs. The OS handles it (demand paging).

Issue: Page Table Size

A single-level page table for a large virtual address space (e.g., 64-bit) can be enormous.

Example: 64-bit VA, 4KB pages ( $2^{12}$ byte offset). VPN is $64 - 12 = 52$ bits.
Number of PTEs = $2^{52}$ .
If each PTE is 4 bytes, total page table size = $2^{52} \times 4 bytes = 2^{54} bytes = 16 Petabytes$ . This is per process and clearly too large to store entirely in physical memory.

Multi-Level (Hierarchical) Page Tables

To manage the large size of page tables:

Idea: Organize the page table hierarchically. Only the top-level page table (and actively used parts of lower-level tables) need to be in physical memory.
The VPN is split into multiple parts, each part indexing a different level of the page table.
The first-level page table entry points to the base of a second-level page table. The second-level entry points to a third, and so on, until the final level points to the PPN.
Benefit: Portions of the page table hierarchy corresponding to unused regions of the virtual address space do not need to be allocated or resident in memory.
Drawback: Address translation now requires multiple memory accesses (one for each page table level) if the intermediate PTEs are not cached. For an N-level page table, N memory accesses are needed to find the final PTE before the actual data access.

Modern systems like x86-64 use multi-level paging (e.g., 4 or 5 levels).

The base of the top-level page table is typically pointed to by a control register (e.g., CR3 in x86).

This concludes the material covered up to slide 46 of Lecture 25b. The lecture would continue discussing challenges in page table management (like the performance overhead of multiple accesses) and solutions like Translation Lookaside Buffers (TLBs), and further delve into memory protection mechanisms.

Continue here: 26 Virtual Memory, Multi-Table-VM, Translation-Lookaside-Buffer, Memory Protection

CS Notes

Explorer