20 GPU Architectures

Lecture from: 11.05.2023 | Video: YT

This lecture focused on Graphics Processing Units (GPUs) as prime examples of modern Single Instruction, Multiple Data (SIMD) architectures, exploring their execution model, architecture, and evolution, particularly in the context of accelerating data-parallel workloads like those in machine learning.

I suggest reading Microarch Summary to make these concepts (which felt overwhelming to me) a bit more clear…

GPUs as SIMD Engines

Graphics Processing Units (GPUs) have evolved from specialized graphics renderers to powerful parallel processors used for a wide range of data-parallel computations. Underneath, GPUs utilize a Single Instruction, Multiple Data (SIMD) architecture, though the programming model exposed to developers is typically based on threads, not explicit SIMD instructions.

In a GPU, a set of threads executing the same instruction on different data elements are dynamically grouped by the hardware into a warp (NVIDIA terminology) or wavefront (AMD terminology). All threads within a warp execute the same instruction concurrently, but each thread operates on its own piece of data.

GPUs employ a programming model called SPMD (Single Program, Multiple Data), where multiple threads run the same program code but operate on different data. This model is well-suited for data-parallel workloads where the same computation is applied to many data elements. The GPU hardware then takes this SPMD code and executes it efficiently on its underlying SIMD/SIMT architecture.

The GPU instruction pipeline operates like a SIMD pipeline, processing elements of a warp concurrently. However, unlike traditional SIMD processors where instructions explicitly specify operations on vector registers, GPU threads execute scalar instructions. The hardware dynamically groups these scalar threads that are at the same Program Counter (PC) into warps and issues them to the SIMD execution units.

Warp Execution and Architecture

A GPU core typically contains a SIMD pipeline, often referred to as a Streaming Processor (SP) in NVIDIA terminology. Many such SPs are grouped together into a Streaming Multiprocessor (SM).

Threads are organized into blocks by the programmer, and the GPU hardware divides these blocks into warps. Each warp is then scheduled onto an SM.

The execution of a warp on an SM utilizes the SIMD functional units in parallel. The structure of a GPU SIMD execution unit resembles that of a vector processor’s functional unit, with lanes processing different elements concurrently. However, the registers feeding these lanes are thread-private (or per-thread ID) registers rather than a single large vector register.

GPUs leverage warp-level fine-grained multithreading to hide latency. An SM can interleave the execution of multiple warps on its simple, in-order scalar pipelines. If a warp stalls (e.g., waiting for data from memory), the SM can quickly switch to another ready warp, keeping the functional units busy and hiding the latency. This requires storing the context (PC, registers) for many warps.

Control Flow in GPUs: Branch Divergence

While SIMD is well-suited for regular parallelism, real-world code often contains control flow (branches). In SPMD code, different threads within the same warp might execute conditional branches based on their unique data values, leading to different threads taking different control flow paths. This is called branch divergence.

In a SIMD pipeline, where all lanes are designed to execute the same instruction, branch divergence reduces SIMD utilization. When a warp encounters a divergent branch, the threads taking one path execute while the threads taking the other paths are masked out (do not execute). Once the first path finishes, the GPU executes the instructions on the other paths for the masked-out threads.

This is similar to predicated execution or masked execution in traditional architectures, but handled dynamically in hardware in GPUs. The GPU uses a mask associated with the warp to track which threads are active for the current instruction based on their path.

GPUs employ techniques like Dynamic Warp Formation/Merging to mitigate the performance loss from divergence. After a divergent branch, threads are potentially re-grouped based on their actual execution path to form new warps with higher SIMD utilization (fewer masked-out threads).

However, the flexibility of dynamically grouping threads is constrained by hardware, particularly the physical mapping of thread registers to SIMD lanes. A thread is tied to a specific lane, limiting which threads can be part of the same warp executing a particular instruction.

Modern GPU Architectures and Evolution

GPUs have seen remarkable growth in computational power over time. Starting with earlier designs like the NVIDIA GeForce GTX 285, which featured streaming multiprocessors containing SIMD functional units, GPUs have scaled significantly in terms of core count and performance.

Modern GPUs like the NVIDIA V100 and H100 feature thousands of stream processors (cores) and exhibit much higher peak performance in floating-point operations and, increasingly, in specialized units for deep learning called Tensor Cores.

Tensor Cores are specialized functional units within SMs designed to accelerate matrix multiplication, a core operation in neural networks. They are examples of specialized hardware within a generally SIMD architecture, reflecting the trend towards domain-specific acceleration.

Modern GPUs also feature larger register files to support many concurrent warps and increasingly sophisticated memory hierarchies to provide the necessary data bandwidth for data-parallel workloads.

Conclusion

This lecture highlighted GPUs as sophisticated SIMD processors that exploit data parallelism through warp-based execution. The SIMT programming model allows developers to write code using threads while the hardware dynamically groups them into warps for SIMD execution on specialized functional units. This approach, combined with warp-level fine-grained multithreading, effectively hides latency and leverages available parallelism, particularly for data-parallel workloads common in graphics and machine learning. While branch divergence remains a challenge, techniques like dynamic warp merging help mitigate its impact on SIMD utilization. The evolution of GPUs demonstrates a continuous increase in parallelism and the addition of specialized units like Tensor Cores to meet the demands of emerging applications.

Continue here: 21 Memory Systems, DRAM

CS Notes

Explorer