22 Memory Hierarchy, Caching, Direct-Mapped, Set-Associative, Fully Associative Caches

Lecture from: 19.05.2023 | Video: YT

This lecture introduces the fundamental concepts of memory hierarchy and caches, explaining why they are essential in modern computer systems and how they operate at a basic level.

Video summarizing this lecture:

The Need for Memory Hierarchy

The ideal memory system would be infinitely large, infinitely fast, have zero cost, infinite bandwidth, and consume zero energy. However, real-world memory technologies present fundamental trade-offs:

Bigger is Slower: Larger memory structures inherently take longer to access. This is due to longer internal signal paths (word-lines, bit-lines) and the increased complexity of locating data.
Faster is More Expensive: Technologies like SRAM are much faster than DRAM but are also significantly more costly per bit and have lower density (requiring more chip area). DRAM, in turn, is faster and more expensive than SSDs or traditional hard disks.
Higher Bandwidth is More Expensive: Achieving higher data transfer rates requires more parallelism (banks, ports, channels) or faster, more power-hungry interface technology.
Energy: Faster memory technologies are generally more energy-efficient per access, but managing the overall energy consumption of the entire memory system is complex. DRAM, for instance, requires refresh operations which consume power even when idle.

Given these opposing requirements, a single level of memory cannot satisfy all needs. The solution is a memory hierarchy:

Idea: Employ multiple levels of storage. Levels closer to the processor are smaller, faster, and more expensive per bit. Levels further away are progressively larger, slower, and cheaper per bit.
Goal: Keep frequently accessed data in the faster, closer levels to make the overall memory system appear nearly as fast as the fastest level and as large as the largest level.

This works due to the principle of locality of reference.

Locality of Reference

Programs tend to exhibit predictable access patterns:

Temporal Locality: If a memory location is accessed, it is likely to be accessed again soon. This is common with loops and frequently used variables.
Spatial Locality: If a memory location is accessed, nearby memory locations are likely to be accessed soon. This is evident in sequential instruction execution and operations on contiguous data structures like arrays.

Caching Basics

Caches are smaller, faster memory levels that store recently or frequently accessed data to exploit locality.

Core Idea (Temporal Locality): Store recently accessed data in an automatically-managed fast memory (the cache) with the anticipation that the same memory location will be accessed again soon.
Core Idea (Spatial Locality): When a memory location is accessed, fetch not just that item but a contiguous block (also called a cache line) of data surrounding it into the cache. The anticipation is that nearby memory locations will be accessed soon.
- Memory is logically divided into fixed-size blocks.
- The cache stores a subset of these blocks.

Cache Operation: When the processor requests data from a memory address:

The cache is checked.
Hit: If the requested data is found in the cache, it is supplied to the processor quickly from the cache.
Miss: If the requested data is not in the cache, it must be fetched from a lower (slower) level of the memory hierarchy (e.g., main memory). The fetched block is then placed into the cache, potentially evicting an existing block if the cache is full.

Key Cache Design Decisions:

Placement: Where can a given memory block be placed in the cache?
Replacement: If the cache is full, which block should be evicted to make space for a new block?
Granularity of Management: What is the size of a block? Are there subblocks?
Write Policy: How are writes to the cache handled and propagated to lower memory levels?
Instruction/Data Separation: Are instructions and data stored in the same cache or separate caches?

Logical Cache Organization

The fundamental question is how to map memory blocks (from the large main memory address space) to the limited number of locations in the cache.

There are three main strategies for block placement:

Direct-Mapped: A memory block can be placed in exactly one predetermined location in the cache. This location is typically determined by some bits of the block’s address (the index bits).
Fully-Associative: A memory block can be placed in any location in the cache.
Set-Associative: A compromise. The cache is divided into multiple sets. A memory block maps to a specific set (determined by index bits), but within that set, it can be placed in any of the N ways (locations). This is an N-way set-associative cache.

Cache Addressing and Basic Operation

To access the cache, the memory address is typically divided into three parts:

Tag: The higher-order bits of the address used to uniquely identify the memory block.
Index: Bits used to select the specific set (or line in a direct-mapped cache) where the block might reside.
Byte Offset (or Block Offset): Lower-order bits used to identify the specific byte (or word) within the fetched block.

Cache Access Steps (General):

Use the index bits of the address to select a set in the cache.
For each way in the selected set (or the single line in a direct-mapped cache): a. Check the valid bit associated with the cache line. If not valid, the line doesn’t hold useful data. b. If valid, compare the tag bits of the incoming address with the tag stored in the cache line’s tag store.
Hit: If a valid line is found with a matching tag, a cache hit occurs. The data from that line is then selected (using the byte offset if necessary) and sent to the processor.
Miss: If no valid line with a matching tag is found in the set, a cache miss occurs.

Cache Types

Direct-Mapped Cache Example

Let’s consider a toy example:

Main Memory: 256 bytes, 8-byte blocks $⟹$ 32 blocks in memory.
Cache: 64 bytes, 8-byte blocks $⟹$ 8 cache lines (blocks).
Since it’s direct-mapped, each memory block maps to one specific cache line.
- Number of sets (lines) $S = 8$ .
- Block size $b = 8$ bytes. Byte offset needs $lo g_{2} (8) = 3$ bits.
- Index bits needed $lo g_{2} (S) = lo g_{2} (8) = 3$ bits.
- Total address bits (for 256-byte memory) = 8 bits.
- Tag bits = Total - Index - Offset = $8 - 3 - 3 = 2$ bits.

Problem with Direct-Mapped Caches: Conflict Misses.

If two (or more) memory blocks that are actively used map to the same cache index, they will repeatedly evict each other, even if other parts of the cache are empty. For example, if block A and block B both map to index $i$ , an access sequence A, B, A, B, … will result in all misses, as A evicts B, then B evicts A, and so on. This can lead to a 0% hit rate for such patterns.

Set-Associative Caches

To reduce conflict misses, set-associativity allows a memory block to map to one of N possible locations (ways) within a designated set.

Example: 2-Way Set-Associative Cache
- Continuing the 64-byte cache, 8-byte block example:
  - If it’s 2-way set-associative, the number of sets $S = Total Cache Blocks / N = 8/2 = 4$ sets.
  - Index bits needed $lo g_{2} (S) = lo g_{2} (4) = 2$ bits.
  - Tag bits = Total - Index - Offset = $8 - 2 - 3 = 3$ bits.
- Now, two memory blocks that map to the same set (e.g., set 0) can co-exist in the cache, one in Way 0 and one in Way 1 of that set.
- Operation: The index selects a set. The tags of all ways in that set are compared simultaneously with the incoming address’s tag. If any way has a valid matching tag, it’s a hit.
- Benefit: Accommodates conflicts better.
- Cost: More complex hardware (multiple comparators, MUX to select data from the hit way), potentially slower access, larger tag store (tags are wider because the index is smaller).

Higher Associativity and Full Associativity

N-Way Set-Associative: As N increases (e.g., 4-way, 8-way), the number of locations a block can map to within its set increases, further reducing conflict misses. However, complexity and cost also increase.
Fully-Associative Cache: The ultimate in flexibility. N equals the total number of blocks in the cache (i.e., there is only one set). Any block can go into any line.
- No index bits are used to select a set. The entire block address (minus offset) is the tag.
- All tags in the cache are compared in parallel.
- Eliminates conflict misses entirely. Misses are only compulsory or capacity.
- Most expensive and potentially slowest due to the large number of comparators.

Associativity Tradeoffs (Summary)

Higher Associativity:
- - Generally higher hit rate (fewer conflict misses).
- - Slower access time (hit latency).
- - More expensive hardware.
There are diminishing returns; the improvement in hit rate lessens as associativity gets very high.

This concludes the review of material from Lecture 22, setting the stage for more advanced cache design and management topics.

Continue here: 23 Cache Design and Management

CS Notes

Explorer