Microarchitecture Concepts (Pipelining, OoO, Superscaler, VLIW, Systolic, SIMD etc, and GPUs) Summary

If you wanna listen to an explanation instead :D

This overview explains key computer architecture concepts, focusing on their definitions, differences, applications, interrelations, and the design decisions behind them, particularly relevant to understanding architectures like Graphics Processing Units.

1. Pipelining

What: Pipelining is a technique used in processor design to increase instruction throughput (the number of instructions completed per unit of time). It works by breaking down the processing of an instruction into a series of smaller, sequential stages (e.g., Fetch, Decode, Execute, Memory Access, Write-back). Multiple instructions can be in different stages of execution simultaneously, akin to an assembly line.
Design Decision: The primary goal of pipelining is to improve instruction throughput, not necessarily to reduce the latency (total time) of a single instruction. By overlapping instructions, more of the processor’s hardware components are kept busy.
Difference from Non-Pipelined: In a non-pipelined processor, one instruction must complete all its stages before the next instruction can begin. Pipelining allows the first stage of a new instruction to begin as soon as the previous instruction moves from the first stage to the second.
Challenges: Pipelining introduces “hazards” – situations that prevent the next instruction in the pipeline from executing during its designated clock cycle. These include:
- Data Hazards: An instruction depends on the result of a previous instruction still in the pipeline. (Solutions: forwarding, stalling).
- Structural Hazards: Two instructions require the same hardware resource at the same time. (Solutions: duplicating resources, stalling).
- Control Hazards (Branch Hazards): The pipeline makes a wrong decision on a branch, fetching instructions from an incorrect path. (Solutions: branch prediction, branch delay slots, flushing the pipeline).
Where used: Virtually all modern processors, including Central Processing Units and Graphics Processing Units, employ pipelining extensively.
Builds on: The concept of an instruction cycle that can be divided into distinct, sequential stages.

2. Out-of-Order Execution

What: Out-of-Order Execution is a sophisticated processor execution model where instructions are dynamically reordered by the hardware for execution based on data dependencies and the availability of execution units, rather than their original program order. This allows the processor to look ahead and find independent instructions to execute while earlier, dependent instructions are stalled (e.g., waiting for data from memory).
Design Decision: The goal is to maximize the utilization of the processor’s execution units and hide memory latency, thereby increasing Instruction-Level Parallelism (the parallelism among individual instructions) and single-thread performance. It makes the hardware more complex but can significantly speed up programs with irregular dependencies.
Difference from In-Order Execution: In-order processors execute instructions strictly sequentially as they appear in the program. Out-of-Order Execution processors can execute instructions “early” if their inputs are ready and an execution unit is free, though they typically commit results in program order to maintain correctness.
Difference from Superscalar/Very Long Instruction Word: Out-of-Order Execution describes when individual instructions are executed relative to each other (dynamic reordering). Superscalar and Very Long Instruction Word architectures concern how many instructions are issued or executed per cycle. Out-of-Order Execution is often combined with superscalar designs.
Where used: Predominantly in high-performance Central Processing Units (e.g., Intel Core series, AMD Ryzen).
Builds on: Complex hardware concepts like reservation stations (based on Tomasulo’s algorithm), register renaming (to resolve false Write-After-Write and Write-After-Read dependencies), and reorder buffers (to ensure instructions are committed in the correct program order).

3. Superscalar Architecture

What: A superscalar architecture allows a processor to issue and execute more than one instruction during a single clock cycle. This is achieved by having multiple redundant execution units (e.g., multiple Arithmetic Logic Units, Floating-Point Units, load/store units).
Design Decision: The goal is to increase Instruction-Level Parallelism by executing multiple instructions truly in parallel within a single core. This requires complex instruction fetch and dispatch logic to identify independent instructions.
Difference from Scalar: Scalar processors execute at most one instruction per cycle.
Difference from Very Long Instruction Word (VLIW): Superscalar processors dynamically identify and dispatch multiple instructions using hardware dependency checking at runtime. VLIW architectures rely on the compiler to statically bundle multiple independent operations into a single, very long instruction.
Difference from Single Instruction, Multiple Data (SIMD): Superscalar architectures execute multiple different instructions (or the same instruction on different data if combined with Out-of-Order Execution) concurrently. SIMD architectures execute a single instruction on multiple data elements simultaneously.
Where used: Most modern Central Processing Units (desktop, server, mobile).
Builds on: The presence of multiple functional units, advanced instruction fetch and decode logic, and is often combined with pipelining and Out-of-Order Execution to effectively utilize the multiple units.

4. Very Long Instruction Word (VLIW) Architecture

What: A Very Long Instruction Word architecture is one where the compiler groups multiple independent, primitive operations (opcodes) into a single, very long instruction. The hardware is simpler as it doesn’t need complex scheduling logic; it directly executes these pre-packaged operations in parallel on multiple dedicated functional units.
Design Decision: The core design decision is to shift the complexity of finding and scheduling parallel instructions from the hardware (as in superscalar) to the compiler. This leads to simpler, potentially lower-power hardware, but heavily relies on the compiler’s ability to find sufficient Instruction-Level Parallelism.
Difference from Superscalar: VLIW relies on static (compile-time) scheduling of parallelism. Superscalar uses dynamic (run-time) hardware scheduling.
Difference from Single Instruction, Multiple Data (SIMD): VLIW architectures execute multiple, potentially different, operations that are part of one long instruction word in parallel. SIMD architectures execute a single type of operation on multiple data elements.
Where used: Some Digital Signal Processors, embedded systems, and historically in some high-performance computing (e.g., Intel’s Itanium processor family, based on the EPIC – Explicitly Parallel Instruction Computing – philosophy, had VLIW principles). Graphics Processing Unit Streaming Multiprocessors share some conceptual similarities in how multiple operations can be dispatched to execution units, but the overall programming model (Single Instruction, Multiple Threads) is different.
Builds on: Advanced compiler technology capable of detecting and scheduling Instruction-Level Parallelism. Requires multiple functional units that are explicitly addressed by different parts of the very long instruction.

5. Systolic Arrays

What: A systolic array is a specialized parallel architecture consisting of a network of simple, regularly connected Processing Elements. Data flows rhythmically across the array through local interconnections, and computation happens at each Processing Element as data passes through. They are designed for specific, high-throughput computations with regular data flow.
Design Decision: The design prioritizes high throughput for specific tasks (like matrix multiplication or convolutions) by maximizing parallel computation and minimizing long-distance data movement. Data is “pumped” through the array, and results emerge after passing through several Processing Elements. This makes them very efficient for their intended applications.
Difference from general-purpose SIMD/Graphics Processing Units: Systolic arrays are typically hardwired or configured for very specific algorithms with fixed dataflow paths. General-purpose Single Instruction, Multiple Data engines (like those in Graphics Processing Units) are programmable for a wider range of data-parallel tasks. Data movement is highly structured and local in systolic arrays.
Where used: Hardware accelerators for Artificial Intelligence/machine learning (e.g., Google’s Tensor Processing Units, specialized Digital Signal Processors), signal processing, and image processing. Tensor Cores within modern Graphics Processing Units are examples of systolic array-like structures.
Builds on: Principles of pipelining and parallel processing with a strong emphasis on localized data communication to reduce bottlenecks and power consumption.

6. Flynn’s Taxonomy of Parallel Architectures

Flynn’s taxonomy classifies computer architectures based on the number of concurrent instruction streams and data streams.

a) Single Instruction, Single Data (SISD)

What: A single processor executes a single instruction stream operating on a single data stream. This is the traditional uniprocessor model.
Design Decision: Simplicity; focused on serial execution.
Where used: Older microcontrollers, very simple Central Processing Units, or as the conceptual basis for a single core in more complex systems.

b) Single Instruction, Multiple Data (SIMD)

What: A single instruction operates concurrently on multiple data elements. This is achieved by having multiple processing elements, each with its own data, all executing the same instruction fetched and decoded by a central control unit.
Design Decision: To efficiently process large arrays or vectors of data by applying the same operation to all elements simultaneously, thus achieving high data parallelism with relatively simple control logic.
Key Idea: Exploits data-level parallelism.
Where used:
- Vector processors (e.g., historical Cray supercomputers).
- Multimedia and vector extensions in Central Processing Units (e.g., MMX, Streaming SIMD Extensions (SSE), Advanced Vector Extensions (AVX), NEON).
- Graphics Processing Units are fundamentally Single Instruction, Multiple Data engines. They execute a single instruction across many threads (grouped into a warp/wavefront) on different data.
Builds on: The concept of data parallelism. Requires hardware with multiple data processing lanes and a mechanism to broadcast instructions.

c) Multiple Instruction, Single Data (MISD)

What: Multiple instructions operate concurrently on a single data stream. This architecture is rare in practice.
Design Decision: Theoretical category; practical applications are niche, often debated. Sometimes used to describe fault-tolerant systems where multiple processors execute different operations on the same data for verification, or certain pipelined architectures where data passes through stages, each applying different instructions.
Where used: Very niche applications; some interpretations might include certain types of pipelined processing or fault-tolerant systems designed for extreme reliability.

d) Multiple Instruction, Multiple Data (MIMD)

What: Multiple autonomous processors, each capable of executing different instruction streams independently on different data streams.
Design Decision: To provide the most general form of parallelism, allowing different parts of a task, or different tasks altogether, to run concurrently. This is highly flexible but requires more complex coordination and communication mechanisms.
Where used: Most modern parallel systems:
- Multi-core Central Processing Units (each core is an independent processor capable of running its own instruction stream).
- Clusters of computers and distributed computing systems.
- A Graphics Processing Unit system (including the Central Processing Unit and the Graphics Processing Unit device) can be seen as MIMD. While each Streaming Multiprocessor on the GPU executes in a Single Instruction, Multiple Data fashion, multiple Streaming Multiprocessors can execute different kernels or different parts of a kernel concurrently, and the CPU runs its own separate programs.
Builds on: Having multiple independent processing units, each with its own control logic and data paths.

7. Graphics Processing Unit (GPU) Architecture Concepts

Graphics Processing Units leverage Single Instruction, Multiple Data principles extensively for massive data parallelism, primarily through an execution model called Single Instruction, Multiple Threads (SIMT).

a) GPU Thread

What: The most basic, programmer-visible unit of parallel execution in Graphics Processing Unit programming models (like NVIDIA’s Compute Unified Device Architecture (CUDA) or the Open Computing Language (OpenCL)). It’s a single instance of a “kernel” (a function written by the programmer to run on the GPU), executing a sequence of scalar instructions. Each thread has its own logical program counter, private registers, and operates on a distinct piece of data.
Design Decision: To provide a fine-grained, highly scalable unit of parallelism that programmers can reason about, allowing the expression of massive data parallelism.
Difference from CPU Thread: GPU threads are extremely lightweight compared to Central Processing Unit threads (which are typically managed by the Operating System). Graphics Processing Units can manage tens of thousands to millions of such threads concurrently. They are hardware-managed for efficiency.
Where used: Programmers define computations in terms of these threads.
Builds on: The Single Program, Multiple Data (SPMD) programming model.

b) Warp (NVIDIA) / Wavefront (AMD)

What: A hardware-managed group of Graphics Processing Unit threads (typically 32 threads for NVIDIA, or 32/64 for AMD). All threads within a warp execute the same instruction at the same time (in lockstep) on their respective private data. This is the fundamental unit of Single Instruction, Multiple Data execution on a Graphics Processing Unit.
Design Decision: To efficiently map groups of threads to the underlying Single Instruction, Multiple Data hardware execution units, amortizing instruction fetch and control overheads across multiple threads. (More details in Section 8).
Difference from individual GPU Thread: A warp is a hardware grouping of threads for SIMD execution. Threads in a warp share an instruction stream at any given moment.
Where used: The Graphics Processing Unit’s Streaming Multiprocessors schedule and execute work in units of warps.
Builds on: GPU threads. This is how GPUs implement their SIMD/SIMT execution model.

c) Block (Thread Block / Cooperative Thread Array)

What: A programmer-defined grouping of Graphics Processing Unit threads (e.g., 64, 128, 256, up to a hardware limit like 1024 threads per block). Threads within the same block can:
- Cooperate by sharing data through a fast on-chip shared memory.
- Synchronize their execution using barriers.
A block is executed entirely on a single Streaming Multiprocessor. The hardware divides a block into one or more warps for execution.
Design Decision: To provide a mechanism for threads to cooperate and share data efficiently, and to allow programmers to organize parallelism hierarchically, matching problem structure to hardware capabilities.
Difference from Warp: A block is a logical, programmer-defined grouping for cooperation and resource sharing, typically comprising multiple warps. A warp is a hardware-imposed grouping for SIMD execution.
Where used: Programmers structure their parallel tasks into blocks to manage computations that require inter-thread communication or synchronization.
Builds on: GPU threads.

d) Single Instruction, Multiple Threads (SIMT) Execution Model

What: The execution model predominantly used by modern Graphics Processing Units. Programmers write code for individual scalar threads (following the Single Program, Multiple Data model). The hardware groups these threads into warps. Each warp executes a single common instruction across all its active threads, but each thread operates on its own private data and has its own register state and logical program counter.
Design Decision: To combine the programming ease of a threaded model with the efficiency of Single Instruction, Multiple Data hardware. It allows for more flexible control flow than traditional vector SIMD.
Key Feature: Branch Divergence Handling: If threads within the same warp encounter a conditional branch based on their unique data values and thus need to take different control flow paths (this is “branch divergence”), the hardware manages this. Typically, it serializes the execution: threads taking one path execute while the threads taking other paths are temporarily deactivated (masked). Once the first path is complete, the roles are reversed until all paths taken by threads in the warp are executed.
Difference from traditional SIMD: Traditional SIMD often involves explicit vector instructions operating on dedicated vector registers. SIMT uses scalar instructions for threads, and the SIMD behavior is managed by the hardware grouping threads into warps. SIMT provides more flexibility in handling control flow divergence.
Where used: The core execution model of modern Graphics Processing Units (NVIDIA, AMD, Intel).
Builds on: Warps/wavefronts, GPU threads, and hardware mechanisms for managing divergent execution paths (e.g., using activity masks).

e) Single Program, Multiple Data (SPMD) Programming Model

What: A programming model where a single program (or “kernel” in GPU terminology) is written by the developer, and this same program is executed by many autonomous processing elements (which are GPU threads in this context). Each thread executes the same program code but operates on different data. Each thread has its own logical instruction pointer and can, therefore, follow different execution paths if the code contains data-dependent branches.
Design Decision: To provide a scalable and intuitive way for programmers to express data parallelism without having to manage low-level vectorization explicitly.
Difference from SIMD (execution model): SPMD is a programming paradigm. SIMD (or SIMT for GPUs) is an execution model describing how the hardware runs the code. SPMD programs map very well to SIMD/SIMT hardware.
Difference from MIMD (programming model): In pure Multiple Instruction, Multiple Data, different processors can run entirely different programs. In SPMD, all processing elements run instances of the same program code.
Where used: The dominant programming model for Graphics Processing Units (e.g., using CUDA or OpenCL frameworks), and also prevalent in high-performance computing using frameworks like Message Passing Interface (MPI).
Builds on: The concept of data parallelism.

f) Shader Core / Streaming Processor (SP) / CUDA Core

What: The most basic execution unit within a Graphics Processing Unit’s Streaming Multiprocessor, capable of performing arithmetic and logical operations for a single thread in a warp. It’s essentially a scalar processor. A Streaming Multiprocessor contains many such Streaming Processors (e.g., 32, 64, 128) that operate in parallel, forming the SIMD lanes.
Design Decision: To provide simple, replicated computational units that can execute in parallel under the control of a single instruction (for a warp), maximizing computational density.
Difference from Central Processing Unit Core: A CPU core is much more complex, typically capable of Out-of-Order Execution, sophisticated branch prediction, and running an entire Operating System. A Streaming Processor is simpler, designed primarily for the arithmetic operations of one thread as part of a warp’s SIMD execution.
Where used: These are the fundamental Arithmetic Logic Units/Floating-Point Units inside a Streaming Multiprocessor.
Builds on: Basic Arithmetic Logic Unit and Floating-Point Unit design. Many Streaming Processors working in lockstep form the SIMD execution lanes of a Streaming Multiprocessor.

g) (Warp-Level) Fine-Grained Multithreading

What: A hardware technique used by Graphics Processing Unit Streaming Multiprocessors to hide latency and maximize the utilization of execution units. An SM can manage the context (Program Counter, registers) for many concurrent warps. If an actively executing warp stalls (e.g., waiting for data from memory, which can take hundreds of cycles), the SM’s warp scheduler can very quickly (often in one or a few cycles) switch to another resident warp that is ready to execute.
Design Decision: To keep the numerous execution units within a Streaming Multiprocessor busy despite long memory latencies inherent in accessing off-chip memory. This prioritizes throughput over single-warp latency.
Difference from Coarse-Grained Multithreading: Coarse-grained switching (e.g., Operating System context switches) occurs on longer stalls and has higher overhead. Fine-grained switching is very fast and frequent.
Difference from CPU Simultaneous Multithreading (SMT): CPU SMT (like Intel’s Hyper-Threading) allows multiple threads (instruction streams) to share the resources of a single, often superscalar and Out-of-Order, CPU core in the same cycle. GPU warp-level multithreading rapidly switches between entire warps, effectively interleaving their execution on the SIMD functional units to hide stalls. The goal of keeping hardware busy is similar, but the granularity and mechanism differ.
Where used: A core feature of modern Graphics Processing Unit Streaming Multiprocessors.
Builds on: Hardware support for storing and quickly switching the contexts for many warps on a Streaming Multiprocessor, and efficient warp scheduling logic.

h) Interrelations and Hierarchy (GPU Context)

Programmers write kernel code using the Single Program, Multiple Data (SPMD) model, defining computations for individual GPU Threads.
These threads are organized by the programmer into Blocks (Cooperative Thread Arrays) for managing locality (data sharing via shared memory) and cooperation (synchronization).
The Graphics Processing Unit hardware groups threads from a block into Warps (or Wavefronts).
Warps are the fundamental units of scheduling and Single Instruction, Multiple Data (SIMD) execution within a Streaming Multiprocessor, following the Single Instruction, Multiple Threads (SIMT) model.
A Streaming Multiprocessor (SM) executes warps using its many Streaming Processors (SPs), which act as the parallel SIMD lanes.
The Streaming Multiprocessor employs Fine-grained Multithreading at the warp level to switch between different warps rapidly, hiding memory and pipeline latencies and keeping the Streaming Processors busy.
The overall GPU architecture is a massively parallel SIMD engine at the SM/warp level. Multiple SMs on a GPU, potentially executing different kernels or parts of a large kernel, along with the Central Processing Unit, contribute to making the entire system an MIMD platform.
Specialized units like Tensor Cores (often using Systolic Array principles) can be part of an SM to further accelerate specific, highly regular SIMD-friendly operations like matrix multiplication, common in Artificial Intelligence workloads.

8. The Rationale for Warps/Wavefronts in Graphics Processing Units

The grouping of threads into warps (or wavefronts) for execution on a Streaming Multiprocessor is a cornerstone of Graphics Processing Unit design. This approach, where threads in a warp are at the same Program Counter at instruction issue, might seem restrictive compared to letting each thread operate completely independently. However, the reasons are rooted in maximizing efficiency for data-parallel workloads and managing hardware complexity.

a) Core Design Principle: Efficiently Exploiting Data Parallelism

Graphics Processing Units are engineered for massive data parallelism – executing the same operations on vast numbers of data elements. The warp/wavefront concept is the hardware’s mechanism to manage and execute these parallel threads in the most efficient manner on its underlying Single Instruction, Multiple Data (SIMD) hardware.

b) Key Benefits of the Warp/Wavefront Model

Maximizing SIMD Hardware Utilization:
- A Streaming Multiprocessor contains numerous simple execution units (Streaming Processors). For example, if an SM has 32 Streaming Processors dedicated to processing a warp, issuing a single instruction for a 32-thread warp allows all 32 Streaming Processors to perform that operation concurrently, each on its respective thread’s data. This yields 32 results with the control overhead equivalent to managing just one instruction stream at that moment, achieving high computational throughput.
Reducing Instruction Processing Overhead:
- Fetch & Decode Amortization: Instead of each of the, say, 32 threads in a warp independently fetching, decoding, and preparing its own instruction for execution, the Streaming Multiprocessor performs these steps once for the entire warp. This drastically reduces the silicon area, complexity, and power that would otherwise be consumed by individual instruction processing logic for every single thread.
- Operand Management: Addressing and collecting operands for an instruction can also be managed more efficiently for a group of threads that are all executing that same instruction.
Simplifying Control Logic:
- Managing tens of thousands of threads (a typical concurrent load on a modern Graphics Processing Unit) as fully independent instruction streams (as in a pure Multiple Instruction, Multiple Data model) would require enormously complex and power-hungry control logic on the chip.
- By grouping threads into warps, the Streaming Multiprocessor’s scheduler and control units manage execution at the warp level. This significantly reduces the number of entities to track and schedule, making the hardware design more tractable, scalable, and power-efficient.
Enhancing Energy Efficiency:
- The reduction in instruction fetch/decode operations per thread and the simplified control logic directly translate to lower power consumption per useful computation performed. For a device executing trillions of operations per second, this energy efficiency is paramount.

c) Comparison with Fully Independent Threads (MIMD Model)

Allowing each thread to “do its own thing in its own time” independently is characteristic of the Multiple Instruction, Multiple Data (MIMD) model, which is how Central Processing Unit cores generally operate.

Central Processing Unit Core Design: CPU cores are optimized for single-thread performance and task parallelism. They feature sophisticated (and large) control logic for features like Out-of-Order Execution and complex branch prediction for their individual instruction streams.
Scalability Challenge for GPUs: Building a Graphics Processing Unit with thousands of such complex “mini-MIMD cores” (one for each thread) would be impractical in terms of silicon area, manufacturing cost, and power consumption for the data-parallel tasks GPUs excel at. The chip would be dominated by control logic rather than by the actual computation units.

d) Flexibility through Single Instruction, Multiple Threads (SIMT) and Branch Divergence

While threads in a warp execute the same instruction from the instruction stream, the Single Instruction, Multiple Threads (SIMT) model provides crucial flexibility:

Branch Divergence: If threads within a warp encounter a conditional branch and need to take different paths based on their individual data, the hardware accommodates this. It serializes the execution of the divergent paths: threads taking one path execute while others are temporarily deactivated (masked). Then, threads that took other paths execute.
Design Trade-off: This serialization means that during periods of divergence, some of the Streaming Multiprocessor’s execution units are temporarily idle, which reduces peak SIMD efficiency. However, this is a fundamental design trade-off. It allows Graphics Processing Units to be highly programmable for a wide variety of algorithms, offering more flexibility than stricter traditional vector-SIMD architectures where all elements in a vector must follow the exact same control flow path.

e) Summary of Design Choices and Trade-offs

The warp/wavefront model is a strategic design choice for Graphics Processing Units that heavily favors throughput and efficiency for data-parallel operations. It achieves this by:

Amortizing the costs of instruction processing (fetch, decode, issue) over many threads within a warp.
Simplifying the hardware required for managing massive concurrency.
Directly and efficiently mapping thread execution to the underlying Single Instruction, Multiple Data (SIMD) computational resources.

The primary “cost” or trade-off is a reduction in absolute efficiency when threads within a warp diverge significantly in their control flow. Nevertheless, this is generally a highly effective trade-off, enabling the massive parallelism and programmability that make Graphics Processing Units indispensable for workloads like graphics rendering, scientific simulations, and machine learning.

9. TLDR

This section provides a very brief overview of the key concepts discussed:

Pipelining: Speeds up processors by overlapping instruction execution stages, like an assembly line, improving instruction throughput.
Out-of-Order Execution: Makes Central Processing Units (CPUs) “smarter” by reordering instructions on the fly to avoid stalls and keep execution units busy, boosting single-thread performance.
Superscalar Architecture: Enables CPUs to execute multiple different instructions simultaneously in a single core by having multiple execution units.
Very Long Instruction Word (VLIW): Simplifies hardware by having the compiler pre-pack multiple independent operations into single “very long” instructions for parallel execution.
Systolic Arrays: Specialized hardware with a grid of simple processors optimized for specific, regular data-flow tasks like matrix multiplication, offering high throughput by minimizing data movement.
Flynn’s Taxonomy (SISD, SIMD, MISD, MIMD): A classification of computer architectures:
- SISD: Traditional single processor, single data stream.
- SIMD (Single Instruction, Multiple Data): One instruction acts on many data elements at once. Key for Graphics Processing Units (GPUs) and vector processing.
- MISD: Multiple instructions on one data stream (rare).
- MIMD (Multiple Instruction, Multiple Data): Multiple processors run different instructions on different data. Typical of multi-core CPUs.
GPU Architecture Concepts:
- GPU Thread: Lightweight unit of execution.
- Warp/Wavefront: A group of ~32 threads executing the same instruction in lockstep (SIMD behavior).
- Block/CTA: A group of threads that can cooperate using shared memory and synchronize.
- SIMT (Single Instruction, Multiple Threads): GPU execution model where threads in a warp run the same instruction but can diverge, offering flexibility on SIMD hardware.
- SPMD (Single Program, Multiple Data): Programming model where many threads run the same program on different data.
- Streaming Processor (SP)/Shader Core: The basic arithmetic unit in a GPU.
- Fine-Grained Multithreading: GPUs rapidly switch between warps to hide memory latency and keep hardware busy.
Rationale for Warps in GPUs: Grouping threads into warps is a core GPU design decision to efficiently execute data-parallel tasks on SIMD hardware by:
- Amortizing instruction fetch/decode costs over many threads.
- Simplifying control logic for managing thousands of threads.
- Maximizing the use of parallel execution units. This design prioritizes throughput and energy efficiency for parallel workloads, accepting a performance cost when threads within a warp diverge.

CS Notes

Explorer