24 Multicore Caching and Cache Coherence, Prefetching

Lecture from: 26.05.2023 | Video: YT

This lecture continues the discussion on advanced caching concepts, focusing on miss latency/cost and Memory Level Parallelism (MLP), before transitioning to the topic of prefetching as a technique to mitigate memory latency.

Recall: Miss Latency/Cost & Memory Level Parallelism (MLP)

We previously discussed that not all cache misses are equal in their impact on performance. The cost of a miss is influenced by:

Service Location: Where the data is fetched from (e.g., L2 cache, L3 cache, local DRAM, remote memory across a network). Each source has a different latency. DRAM accesses themselves can have varying latencies (row hit vs. row conflict).
Processor Stall: How much the miss actually stalls the processor. This depends on:
- Overlap with other operations: Modern out-of-order processors can often continue executing other independent instructions or service other memory requests while a miss is pending. This is known as Memory Level Parallelism (MLP).
- Data immediacy: Is the requested data on the critical path of execution?
- Eviction cost: Does servicing this miss evict a block that is more critical or more expensive to re-fetch later?

Memory Level Parallelism (MLP) refers to the ability of a system to generate and service multiple memory accesses concurrently.

An isolated miss (like miss A) fully exposes its latency as stall time if the processor is waiting for that data.
Parallel misses (like B and C) have their latencies overlapped. The effective stall time is determined by the longest of the parallel misses, not their sum.

Therefore, eliminating an isolated miss often provides a greater performance benefit than eliminating one of several parallel misses. Traditional replacement policies like LRU or even Belady’s OPT typically aim to minimize the miss count but don’t inherently consider the varying costs or MLP characteristics of misses.

Example: Fewest Misses $\neq =$ Best Performance

Consider a scenario with a sequence of memory accesses, some of which can be serviced in parallel (P-blocks) and some which are isolated (S-blocks).

Belady’s OPT: Minimizes the total number of misses (e.g., 4 misses). However, if these misses include isolated, high-cost S-block misses, the total stall time might be high (e.g., 4 units of stall).
MLP-Aware Policy: Aims to keep the high-cost S-blocks in the cache, even if it means incurring more misses on the lower-cost P-blocks. This might result in more total misses (e.g., 6 misses) but a lower overall stall time (e.g., 2 units of stall) because the P-block misses are parallelized.

This illustrates that optimizing solely for miss rate can be suboptimal for execution time. Cache management policies should ideally be cost-aware and MLP-aware.

Multi-Core Issues in Caching

In multi-core systems, cache design becomes even more critical due to shared resources like memory bandwidth and last-level caches.

Shared vs. Private Caches:
- Private caches (e.g., per-core L1, L2) offer fast access but can lead to data replication and coherence complexities.
- Shared caches (e.g., a common L3) allow dynamic space allocation, avoid data replication within that level, and simplify coherence for blocks within it, but can suffer from inter-core interference and slower access times.

Resource Sharing Challenges: Sharing resources (like caches or memory bandwidth) can improve overall utilization but also leads to contention, potentially degrading individual thread performance and making performance less predictable (loss of performance isolation).
The goal is to design shared resources that are efficiently utilized, controllable, and partitionable to provide both high throughput and quality of service (QoS).

Cache Coherence

When multiple processors have private caches, they might cache copies of the same memory block. If one processor modifies its copy, other caches (and main memory) become stale, leading to inconsistencies. This is the cache coherence problem.

Mechanisms are needed to ensure all processors see a consistent view of memory. Two main approaches:

Snoopy (Broadcast-Based) Coherence: Caches monitor (snoop) a shared bus for memory operations. When a write to a shared block occurs, other caches holding that block either invalidate or update their copies.
- Pros: Lower latency.
- Cons: Relies on a bus, which limits scalability.
Directory-Based Coherence: A central (logically) directory keeps track of which caches have copies of which blocks. Processors consult the directory for permission to read/write. The directory orchestrates invalidations/updates.
- Pros: More scalable than snoopy protocols as it doesn’t rely on a single broadcast bus.
- Cons: Higher latency due to indirection through the directory. The directory itself can be large and complex.

Coherence Protocols: Implement either an update or invalidate policy upon a write to a shared block.

Update: Broadcast the new data to all sharing caches. Good if data is read frequently by others after a write. Can consume significant bandwidth.
Invalidate: Send invalidation messages to other caches. The writer then has exclusive access. More common, as it’s often more bandwidth-efficient if writes are frequent or data isn’t immediately re-read by others.

Prefetching

Prefetching is a technique to mitigate memory latency by fetching data into the cache before it is explicitly requested by the program.

Why Prefetch?

Memory latency is a major performance bottleneck.
If data can be prefetched accurately (the correct data is fetched) and early enough (data arrives before it’s needed), the latency of accessing that data can be significantly reduced or eliminated.
Prefetching can eliminate all types of cache misses: compulsory, capacity, and conflict, and even coherence-related misses if done effectively.

Prefetching and Correctness

A crucial aspect of prefetching is that it is a performance optimization that generally does not affect program correctness.

If a prefetcher mispredicts and fetches useless data, that data simply occupies cache space (potentially causing cache pollution) but isn’t used by the CPU for computation in a way that alters the architectural state.
There’s no need for state recovery mechanisms like those required for branch misprediction or value misprediction. This freedom allows for more aggressive and speculative prefetching strategies.
However, prefetching can have security implications (e.g., side channels) if not carefully designed.

Prefetching Basics

Prefetching is usually done at cache block granularity.
It can reduce both miss rate (if data is in cache when needed) and miss latency (if data is already on its way when requested).
Prefetching can be initiated by:
- Hardware: Dedicated logic monitors access patterns and issues prefetches.
- Software:
  - Compiler: Inserts prefetch instructions into the code.
  - Programmer: Manually inserts prefetch hints or instructions.
- System (OS): Can prefetch program pages or files.

The Four Key Questions in Prefetching

Designing a prefetcher involves answering four fundamental questions:

What to prefetch? (Address prediction: Which addresses will be needed?)
When to prefetch? (Timing: How early or late should prefetches be initiated for optimal timeliness?)
Where to prefetch?
- Where to place the prefetched data (e.g., L1, L2, L3 cache, or a dedicated prefetch buffer)?
- Where to locate the prefetcher logic itself in the memory hierarchy?
How to prefetch? (Mechanism: Is it software-driven, hardware-driven, execution-based, cooperative, hybrid?)

Challenge in Prefetching: What (Address Prediction)

Prefetching useless data wastes resources: memory bandwidth, cache/buffer space, and energy.
Accuracy is crucial: Defined as (Used Prefetches) / (Sent Prefetches).
Address prediction can be based on:
- Past access patterns: Hardware prefetchers often learn from recent history.
- Compiler/programmer knowledge: Software prefetching leverages static program structure.
The prefetching algorithm determines what to prefetch. Some predictable access patterns include:
- Sequential/Streaming: A, A+1, A+2, … (Stride = 1)
- Strided: A, A+N, A+2N, … (Constant stride N)
- Complex Regular (e.g., Multi-Stride): Repeating sequence of different strides (e.g., A, A+2, A+5, A+9, A+11, A+14, A+18, … where strides are +2, +3, +4, repeating).
- Correlated/Irregular: Patterns based on data structure traversals (e.g., linked lists, trees) or indirect accesses.

Challenge in Prefetching: When (Timeliness)

Prefetching too early:
- The prefetched data might be evicted from the cache/buffer before it’s used.
- It occupies resources for a longer, potentially unnecessary, period.
Prefetching too late:
- Might not hide the full memory latency; the CPU still waits.
Timeliness is key. Prefetches should arrive “just in time.”
Achieving timeliness:
- Hardware: Make the prefetcher more aggressive (e.g., prefetch further ahead in a stream).
- Software: Insert prefetch instructions earlier in the code, far enough ahead of the actual use.

Challenge in Prefetching: Where (Placement)

Which level of cache to prefetch into?
- Prefetching directly into L1 is aggressive and can pollute the small L1 with speculative data.
- Prefetching into L2 or L3 is more common for hardware prefetchers; these levels are larger and act as a buffer.
- Prefetchers can also exist between cache levels (e.g., an L2 prefetcher fetching data from L3 into L2).
Where to place prefetched data within a cache set?
- Should prefetched blocks be treated the same as demand-fetched blocks (blocks fetched due to an actual CPU request)?
- Demand-fetched blocks are known to be needed. Prefetched blocks are speculative.
- A common LRU policy places new blocks in the MRU position. Placing prefetched blocks there might prematurely evict useful demand-fetched data.
- Alternative: Place prefetched blocks in the LRU position or a less privileged position to reduce cache pollution.
Where to place the hardware prefetcher logic?
- Placing it closer to the CPU (e.g., observing L1 accesses) allows it to see a more complete access pattern (hits + misses), potentially improving accuracy and coverage. However, it processes more requests, increasing its own complexity and bandwidth demand.
- Placing it further (e.g., observing L2 misses only) means it sees a filtered stream, which might simplify pattern detection for some patterns but miss others.

Challenge in Prefetching: How (Mechanism)

Software Prefetching:
- The ISA provides explicit prefetch instructions (e.g., x86 PREFETCHh).
- The programmer or compiler analyzes the code and inserts these instructions to hint to the hardware about future memory needs.
- Works well for regular, predictable access patterns (e.g., array traversals). Less effective for complex, data-dependent patterns.
Hardware Prefetching:
- Specialized hardware automatically monitors memory accesses (addresses, PCs).
- It learns patterns (strides, correlations) and generates prefetch addresses.
- Examples: Stream prefetchers, stride prefetchers.
Execution-Based Prefetching (e.g., Runahead Execution):
- A separate “helper thread” or the main thread in a speculative mode executes ahead of the normal program execution.
- The primary purpose of this speculative execution is to trigger memory accesses that act as prefetches for the main, non-speculative thread.
- Can be very accurate as it follows the program’s control flow.

Prefetcher Performance Metrics

Beyond accuracy, coverage, and timeliness:

Bandwidth Consumption: How much extra memory bandwidth does the prefetcher use?
- Good prefetchers can utilize otherwise idle bus bandwidth.
Cache Pollution: Do prefetched blocks evict more useful demand-fetched blocks? This leads to extra demand misses.

Effective prefetching can improve performance and even reduce the need for larger cache hardware to achieve a target performance level, as shown by studies like those on Sun ROCK’s “Scout” prefetching (a form of runahead execution).

Example Hardware Prefetcher: Stride Prefetcher

A common and effective type of hardware prefetcher.

Concept: Detects accesses with a constant stride (e.g., A, A+N, A+2N, …).
Idea: Record the stride between consecutive memory accesses. If the stride is stable (repeats consistently), use it to predict and prefetch the next M memory accesses (A+3N, A+4N, …, A+(2+M)N).
Two main types:
1. Instruction-Based Stride Prefetching:
  - Tracks strides on a per-instruction (per-PC) basis. Each load/store instruction can have its own stride pattern.
  - A table stores (PC_tag, Last_Address_Referenced, Last_Stride, Confidence).
  - Timeliness can be an issue; often needs to prefetch multiple steps ahead.
2. Memory-Region-Based Stride Prefetching:
  - Tracks strides based on memory regions, not individual instructions. Can detect strides arising from accesses by multiple different instructions to the same region.
  - A table stores (Address_tag_of_region, Stride, Control/Confidence).
  - Stream prefetching (fetching consecutive blocks, N=1) is a special case.

Stride prefetchers work well for simple regular patterns. More complex patterns (multiple interleaved strides, irregular patterns like linked-list traversals) require more sophisticated prefetchers. Modern prefetchers often employ techniques to detect and predict more complex sequences of strides (delta correlation).

Path Confidence Based Lookahead Prefetching:

An advanced technique that records a history of strides (a “path” or “signature”).
It then predicts the next stride based on this history.
Can use bootstrap prediction: the predicted next stride is used to extend the history, allowing further predictions until a confidence threshold is met. This allows prefetching far ahead.

Continue here: 25 Prefetching, RL based PF, Execution-Based PF, Virtual Memory

CS Notes

Explorer

24 Multicore Caching and Cache Coherence, Prefetching

Recall: Miss Latency/Cost & Memory Level Parallelism (MLP)

Example: Fewest Misses $\neq =$ Best Performance

Multi-Core Issues in Caching

Cache Coherence