Lecture from: 25.05.2023 | Video: YT

This lecture delves into the intricacies of cache design and management, building upon the foundational concepts of memory hierarchies. We will explore various policies, architectural choices, and software techniques that influence cache performance.

The Memory Hierarchy Context

Modern computer systems employ a hierarchy of memory levels, each with varying speed, capacity, and cost.

Typically, closer to the processor, caches like L1 are small (tens of KB) and fast (nanosecond latency), while further levels like L3 are larger (megabytes) but slower (tens of nanoseconds). Main memory (DRAM) offers gigabytes of storage but with latencies around 100ns.

This hierarchy can even extend beyond a single machine. In large-scale servers, remote memory architectures allow a compute node to access memory on other dedicated memory nodes over a network. This approach addresses the capacity limitations of a single node, enabling support for very large, data-intensive workloads by creating an even larger, albeit slower, tier in the memory hierarchy.

Cache Organization and Associativity

The way data is mapped from main memory to the cache is a fundamental design choice.

  • Direct-Mapped Cache: Each memory block can only be placed in one specific location in the cache. This is simple but can suffer from conflict misses if multiple frequently used blocks map to the same location.
  • Fully-Associative Cache: A memory block can be placed in any location in the cache. This offers the most flexibility and eliminates conflict misses but is complex and costly due to the need for many comparators.
  • Set-Associative Cache: A compromise where the cache is divided into sets, and a memory block maps to a specific set but can be placed in any of the ‘ways’ (locations) within that set. For an N-way set-associative cache, there are N possible locations within a set for a given block.

Trade-offs of Associativity

Increasing associativity generally improves the hit rate by reducing conflict misses. However, it also increases hardware complexity (more comparators and wider MUXes), cost, and potentially the cache access time (hit latency). The benefits in hit rate often show diminishing returns as associativity becomes very high.

Management in Set-Associative Caches

When a block can go into multiple ‘ways’ within a set, policies are needed to manage these ways. These policies revolve around three key decisions:

  1. Insertion: When a new block is brought into a set (cache fill), where is it placed (e.g., in the Most Recently Used - MRU - position)? Should it even be inserted if it’s predicted to have no reuse?
  2. Promotion: If a block already in the set is accessed (a hit), how is its priority updated (e.g., promoted to MRU)?
  3. Eviction/Replacement: If a set is full and a new block needs to be brought in (a miss), which existing block should be removed?

Eviction/Replacement Policies

If a cache set is full, an existing block must be evicted. Common policies include:

  • Random: Evict a randomly chosen block. Simple and can be effective against thrashing.
  • FIFO (First-In, First-Out): Evict the block that has been in the set the longest.
  • LRU (Least Recently Used): Evict the block that hasn’t been accessed for the longest time. This policy aims to capitalize on temporal locality.

Implementing true LRU becomes complex for highly associative caches. For an N-way set, tracking the precise access order requires encoding states. For instance, a 2-way cache needs 1 bit per set, but a 4-way cache needs at least bits per set. This complexity leads to the use of LRU approximations in many modern processors, such as Not-MRU (evicting any block but the MRU) or tree-based LRU.

In scenarios like set thrashing (where the working set mapping to a set exceeds its associativity, e.g., accessing 5 blocks cyclically in a 4-way set), LRU performs poorly (0% hit rate), while Random replacement can perform better.

The optimal replacement policy (Belady’s OPT or MIN) is to evict the block that will be referenced furthest in the future. While it guarantees the minimum miss rate, it’s not practically implementable (requires future knowledge) and isn’t necessarily optimal for overall execution time because it doesn’t consider variable miss costs or the ability of modern CPUs to overlap miss latencies (Memory-Level Parallelism - MLP).

Cache Write Policies

Handling writes to the cache involves several important design decisions. One crucial piece of metadata is the dirty bit (or modified bit) in the tag store, which indicates if a cache block has been changed since being loaded.

Write-Through vs. Write-Back

This determines when modified data is propagated to the next lower memory level.

  • Write-Through: Data is written to the current cache level and simultaneously to the next lower level.
    • Pros: Simpler design, main memory is always up-to-date (simplifies coherence).
    • Cons: High bandwidth usage, no write combining (multiple writes to the same block go to memory individually).
  • Write-Back: Data is written only to the current cache level, and the block is marked “dirty.” The modified block is written to the next level only when it’s evicted.
    • Pros: Allows write combining, reducing bandwidth and energy by consolidating multiple writes to a block into a single write-back upon eviction.
    • Cons: More complex (needs dirty bits), next level isn’t always up-to-date (complicates coherence).

Most high-performance caches today use write-back.

Write-Allocate vs. No-Write-Allocate

This policy applies on a write miss.

  • Write-Allocate (Fetch-on-Write): On a write miss, the block is first fetched into the cache, and then the write is performed.
    • Pros: Allows subsequent writes to the block to be combined (if write-back is used).
    • Cons: Fetches the entire block even if only a small part is written, potentially inefficient.
  • No-Write-Allocate (Write-Around): On a write miss, the block is not brought into the cache. The write goes directly to the next lower level.
    • Pros: Conserves cache space if the locality of written blocks is low.

The common combination is write-back with write-allocate.

Subblocked (Sectored) Caches

To handle writes more efficiently, especially when write granularity (e.g., 4 bytes) is much smaller than block granularity (e.g., 64 bytes):

  • Idea: Divide a cache block into smaller subblocks or sectors. The block has one tag, but each subblock has its own valid and dirty bits.
  • Benefits:
    • Allows writing to a subblock without fetching the entire block (if it’s a full subblock write).
    • Reduces data transfer on misses if only specific subblocks are needed.
    • Finer granularity for cache management.
  • Drawbacks: More complex design due to additional valid/dirty bits.

Cache Hierarchy Design Choices

Instruction vs. Data Caches

  • First-Level (L1) Caches: Almost always split into separate Instruction (I-cache) and Data (D-cache).
    • Reason: Primarily due to pipeline design. Instructions are fetched in early pipeline stages, and data is accessed in later stages. Separate caches allow these accesses to occur in parallel without structural hazards.
  • Outer-Level (L2, L3) Caches: Usually unified, storing both instructions and data.
    • Benefit: Better overall cache utilization through dynamic sharing of space.

Multi-level Cache Management

Design and management policies vary significantly by cache level:

  • L1 Caches: Latency is critical. Tags and data are typically accessed in parallel. Policies are optimized for speed.
  • L2/L3 Caches: Larger and higher associativity. Access latency is less critical than L1. To save power, tags and data stores are often accessed serially (check tag first, if hit, then access data). Access energy is a major concern for large outer-level caches.

Accessing Cache Levels

  • Serial Access: L2 is accessed only if L1 misses. L3 is accessed only if L2 misses, and so on.
    • Pros: Saves power and bandwidth by avoiding unnecessary accesses to lower levels.
    • Cons: Increases latency on a miss cascade.
    • The filtering effect is significant: L2 sees only L1 misses, L3 sees only L2 misses. The access patterns and locality characteristics are very different at each level, necessitating different management policies.
  • Parallel Access: L2 is accessed speculatively at the same time as L1.
    • Pros: Faster if L1 misses.
    • Cons: Wasteful if L1 hits. Most systems use serial access between levels.

Hierarchy Inclusivity

  • Inclusive: Data in an inner cache (e.g., L1) is also present in outer caches (L2, L3). Simplifies coherence but duplicates data.
  • Exclusive: Data resides in at most one cache level. Maximizes unique capacity but makes data movement and coherence more complex.
  • Non-Inclusive: No strict guarantee. Offers design flexibility.

Cache Performance and Influencing Factors

Key parameters influencing cache performance (hit/miss rates and latencies) include cache size, block size, associativity, and various management policies.

Cache Size

Larger caches can hold more of an application’s working set (actively used data), improving temporal locality and hit rates. However, very large caches are slower and costlier. The benefit of increasing cache size varies greatly among applications, with some showing low utility, some high utility, and others saturating once their working set fits.

Block Size

The optimal block size balances spatial locality exploitation with other factors.

  • Small blocks: Poor spatial locality, high tag overhead.
  • Large blocks: Good for spatial locality but can reduce the number of distinct blocks in a fixed-size cache (hurting temporal locality) and increase miss penalty. Wasted bandwidth if only a small part of the large block is used.
    • Critical-Word First: Fetches the requested word of a large block first to reduce effective miss latency.
    • Subblocking: Allows transferring only needed parts of a large block.

Associativity

Higher associativity reduces conflict misses. It’s feasible to have non-power-of-2 associativity (e.g., 3-way), as it primarily affects the number of comparators and MUXing within a set, not address decoding for set selection.

Classification of Cache Misses (The 3 Cs)

  1. Compulsory (Cold): First-time access to a block.
  2. Capacity: Cache is too small for the working set, even if fully associative.
  3. Conflict: Misses due to too many blocks mapping to the same set in direct-mapped or set-associative caches, even if overall capacity is sufficient.

Software Approaches for Higher Hit Rate

Software can significantly impact cache performance by adapting code and data structures to the cache’s characteristics.

Restructuring Data Access Patterns

  • Loop Interchange: Modify loop nesting to access data in the order it’s stored in memory (e.g., row-major vs. column-major access).
    • Example: For a column-major array x[i][j], iterating j (columns) in the outer loop and i (rows) in the inner loop (for j ... for i ... x[i][j]) ensures contiguous memory accesses in the inner loop, improving spatial locality.
  • Blocking (Tiling): Divide large data structures (like matrices in matrix multiplication) into smaller blocks or tiles that fit into the cache. Computations are then performed tile-by-tile. This enhances temporal locality for the tiled data.

Restructuring Data Layout

  • Hot/Cold Data Separation: Identify frequently accessed (hot) and infrequently accessed (cold) fields within a data structure.
    • Example: In a linked list node struct {Node* next; int key; char large_data[256];}, next and key are hot during traversal, while large_data is cold until a match.
    • Restructuring: Place hot fields in one structure and cold fields in a separate structure, linked by a pointer.
      // Original
      struct Node_orig {
          Node* next;     // Hot
          int key;        // Hot
          char name[256]; // Cold
      };
       
      // Restructured
      struct Node_hot {
          Node_hot* next;
          int key;
          Node_cold* cold_data_ptr; // Pointer to cold part
      };
      struct Node_cold {
          char name[256];
      };
    • This ensures that cache lines are filled predominantly with hot data during common operations (like list traversal), improving cache utilization and hit rates.

Continue here: 24 Multicore Caching and Cache Coherence, Prefetching