Lecture from: 05.05.2023 | Video: YT
This lecture introduced Single Instruction, Multiple Data (SIMD) architectures, a key paradigm for exploiting data parallelism, and discussed its historical context, core principles, advantages, disadvantages, and modern instantiations in general-purpose processors and specialized accelerators.
Introducing SIMD
While OoO and Superscalar architectures focus on executing independent instructions (potentially different operations) in parallel, SIMD targets a different form of concurrency: performing the same operation on multiple pieces of data simultaneously.
This type of parallelism, known as Data Parallelism, is prevalent in applications like multimedia processing, scientific computing, and machine learning, where the same computation needs to be applied across large datasets. SIMD architectures are designed specifically to exploit this regularity.
Flynn’s Taxonomy: Classifying Parallel Computers
To categorize different computer architectures based on how they handle instructions and data streams, Flynn’s Taxonomy (introduced by Michael Flynn in 1966) provides a useful framework. The taxonomy classifies architectures based on the number of instruction streams and data streams processed concurrently:
- SISD (Single Instruction, Single Data): Traditional scalar processors, executing one instruction on one data element at a time.
- SIMD (Single Instruction, Multiple Data): Executes a single instruction on multiple data elements simultaneously. This is the focus of this lecture.
- MISD (Multiple Instruction, Single Data): Executes multiple instructions on a single data stream. Systolic arrays are often considered the closest practical examples.
- MIMD (Multiple Instruction, Multiple Data): Executes multiple independent instruction streams on multiple data streams. This includes multiprocessors and multithreaded processors.
SIMD Processing Paradigm: Exploiting Data Parallelism
The SIMD processing paradigm is fundamentally about exploiting regular (data) parallelism.
The core idea is that a single instruction dictates the same operation to be performed on multiple data elements. For example, a single vector add instruction might specify the addition of corresponding elements from two vectors, resulting in a new vector where each element is the sum of the corresponding input elements. This is distinct from dataflow where different operations might fire in parallel based on data availability, or thread parallelism where completely different instruction streams execute concurrently.
The efficiency benefit of SIMD arises from amortizing the overhead of instruction fetch, decode, and control over multiple data operations. Instead of fetching and decoding an ‘add’ instruction one million times to add two one-million-element vectors in a scalar loop, a single vector add instruction is fetched and decoded once, controlling one million additions performed on vector elements.
Vector or array processors, as SIMD machines, leverage a “time-space duality” to process multiple data elements:
- Array Processors: Use multiple Processing Elements (PEs) or execution units in space to perform the same operation on multiple data elements at the same time. Data elements are processed concurrently by different PEs.
- Vector Processors: Use potentially fewer PEs (often pipelined) to perform the same operation on multiple data elements in consecutive time steps. Data elements flow through the PE over time.
Vector Processors: Architecture and Operation
Vector processors are a historical and foundational example of SIMD architectures. They operate on one-dimensional arrays of numbers (vectors).
Key architectural components include:
- Vector Registers: Store multiple data elements as a vector (e.g., N M-bit values).
- Vector Control Registers: Include registers like Vector Length (VLEN) to specify the number of elements to process in a vector operation, Vector Stride (VSTR) to indicate the distance between consecutive data elements in memory, and Vector Mask Register (VMASK) for handling conditional operations on vector elements.
- Vector Functional Units: Often deeply pipelined to process elements of a vector operation efficiently.
Memory Banks
Loading and storing vectors efficiently from/to memory is crucial. To sustain a high throughput (ideally one element per clock cycle) when memory access latency is longer than one cycle, memory is organized into banks. Data elements of a vector are interleaved across these banks. By accessing different banks in consecutive cycles, the processor can hide the memory latency and receive elements at a rate of one per cycle, provided the stride and number of banks are appropriately chosen.
Vector instructions allow for deep functional unit pipelines without the control hazards (branches) or data hazards (dependencies between elements of the same vector) that plague scalar pipelines. This simplifies control and allows for high functional unit utilization.
However, vector processors are specialized. They are highly efficient only if the parallelism in the application can be expressed as vector operations (regular data parallelism). They are inefficient for workloads with irregular parallelism, such as linked list traversals, where data dependencies prevent parallelization.
Furthermore, memory bandwidth can become a bottleneck if data is not laid out and accessed with appropriate strides that leverage the memory banking structure. Achieving optimal data mapping to memory banks can be challenging, sometimes requiring programmer effort or sophisticated compiler/runtime support.
Despite these limitations, vector processors were highly successful in scientific computing.
Continue here: 20 GPU Architectures