Lecture from: 04.05.2023 | Video: YT
We introduced two additional paradigms for achieving instruction-level concurrency: Very Long Instruction Word (VLIW) architectures and Systolic Arrays. These approaches represent alternative design points compared to the Out-of-Order (OoO) superscalar processors discussed previously, differing significantly in their philosophy regarding hardware-software co-design and specialization.
VLIW Architectures: A Software-Centric Approach
Very Long Instruction Word (VLIW) architectures propose a different philosophy for achieving superscalar-like performance by shifting the responsibility for finding and packaging parallel instructions from hardware to software, specifically the compiler.
In traditional superscalar processors, the hardware dynamically fetches multiple instructions, checks for dependencies between them, and schedules independent instructions for execution on available functional units.
The core idea of VLIW is that the compiler performs the complex analysis required to identify independent instructions and packs them together into a single, larger instruction called a VLIW instruction word or “bundle.” This bundle is then fetched and executed by the hardware in a single cycle.
A VLIW instruction word consists of multiple operation slots, each designed to be executed by a specific type of functional unit (e.g., an integer ALU slot, a floating-point unit slot, a load/store unit slot). The compiler aligns the independent instructions it finds into the appropriate slots within the VLIW word. A key characteristic of traditional VLIW is that hardware does not perform dependency checking between instructions within the same VLIW bundle. The compiler guarantees their independence.
Traditional VLIW architectures operate in lock-step: all operations within a fetched VLIW bundle begin execution concurrently, and the processor waits for the longest-latency operation in the bundle to complete before proceeding to the next VLIW word. If any operation in the bundle stalls, the entire bundle (and thus all concurrent operations) stalls.
This design philosophy leads to a trade-off: Simple Hardware, Complex Compiler. The hardware control logic is simplified because it doesn’t need to perform dynamic scheduling, renaming, or complex dependence checking within a bundle. The complexity is shifted to the compiler, which must perform sophisticated static analysis and scheduling to effectively fill the VLIW slots with independent instructions and minimize stalls arising from dependencies or variable-latency operations.
In an ideal scenario, a W-wide VLIW processor could achieve an IPC of W by fetching and executing W useful instructions in each cycle, packed into VLIW words. However, achieving this ideal IPC is challenging. The compiler must find enough independent instructions to fill all slots in every VLIW word. If it cannot, it must insert NOPs (No Operations), which waste functional unit slots and reduce the effective IPC.
A major challenge for the VLIW compiler is dealing with variable-latency operations, particularly memory accesses (loads and stores). The time it takes to complete a load instruction is highly unpredictable due to cache hits/misses and memory access times, which are not known at compile time. If a VLIW word contains a load, and the compiler schedules instructions based on an assumed latency, but the actual latency is longer, the entire VLIW bundle (and potentially subsequent bundles in a pipelined VLIW) will stall in lock-step. This lack of tolerance for variable latency significantly limits performance.
Compilers use techniques like profiling to estimate instruction latencies and branch behaviors to make better scheduling decisions. However, profiling is input-set dependent, and real-world program execution may deviate from the profile, leading to suboptimal schedules and increased stalls at runtime.
Despite these challenges, VLIW has seen commercial success, particularly in embedded systems, Digital Signal Processors (DSPs), and some early GPUs, where workloads are often more predictable, and power/energy constraints are stringent. These markets can leverage the simpler hardware design and potential for energy efficiency per operation, even if not all slots are always filled.
The VLIW philosophy has also heavily influenced compiler design. Techniques developed for VLIW compilation, such as Trace Scheduling and Superblock scheduling, aim to identify and optimize frequently executed paths in the control flow graph by creating larger, straight-line code blocks (traces or superblocks) that are amenable to aggressive static scheduling and parallelization.
However, forming these blocks and optimizing them requires complex compiler analysis and can increase code size due to necessary code duplication.
In summary, VLIW offers the potential for simplified hardware and potentially higher clock frequencies by shifting complexity to the compiler. However, its limitations in handling variable latency operations, the difficulty for compilers to find sufficient parallelism in irregular code, and the challenges of managing code size and recompilation have limited its widespread adoption in the general-purpose CPU market compared to Out-of-Order Superscalar execution.
Systolic Arrays: Specialized Parallel Accelerators
Systolic Arrays represent a paradigm of highly specialized parallel architectures designed for specific computation domains, contrasting sharply with the general-purpose nature of CPUs.
The primary goal of systolic arrays is to design accelerators with a simple, regular structure, high concurrency, and balanced computation and I/O (memory) bandwidth.
The core idea is to replace a single processing element (PE) with a regular array of PEs, carefully orchestrating the flow of data between these PEs. Data flows from memory into the array, passes through multiple PEs in a rhythmic fashion, and the collectively transformed data is eventually written back to memory.
Unlike CPUs executing sequential instructions from a program counter, systolic arrays are often data-driven or stream-driven. Data elements (inputs, weights, intermediate results) flow through the array, and PEs perform fixed operations on the data they receive. The timing and choreography of data flow are crucial for correctness and efficiency, often requiring careful orchestration of input data streams.
Systolic arrays are particularly well-suited for computations that involve repetitive operations on large streams of data with predictable data access patterns and local data reuse. Examples include:
- Convolution: Used extensively in filtering, image processing, signal processing, and Convolutional Neural Networks (CNNs), a key component of modern machine learning.
- Matrix Multiplication: Another fundamental computation suitable for systolic arrays, requiring multiplying elements from two matrices and accumulating results.
Systolic arrays offer significant advantages for the specific computation domains they target:
- High Efficiency and Performance: Specialized PEs and orchestrated data flow maximize computation per data element brought from memory, leading to high throughput and efficiency.
- Simple, Regular Design: Arrays of identical, simple PEs make design and implementation easier compared to complex general-purpose processors.
- Balanced I/O: Careful data orchestration balances computation with memory bandwidth, mitigating the memory bottleneck.
However, their major downside is Specialization. They are not generally applicable to arbitrary computations. The computation must “fit” the structure and functions of the PEs in the array. Programming them for different tasks often requires different array structures or significant effort to map the computation efficiently onto a potentially suboptimal fixed array.
Early systolic arrays were strictly hardwired for a single function. Later designs introduced more programmability, allowing PEs to store multiple “weights” or parameters that could be selected on the fly, enabling variations of the core function (e.g., adaptive filtering). Further extensions allow PEs to have local memory and execute small “kernels” or sequences of operations, leading to more generalized stream processing or staged execution models, where complex computations are broken into stages flowing through specialized PEs.
Despite the historical focus on specialized hardwired designs, modern examples, particularly in machine learning accelerators, showcase the continued relevance and evolution of the systolic array concept. Google’s Tensor Processing Unit (TPU) is a prime example, utilizing large 2D systolic arrays of multiply-accumulate units optimized for matrix multiplication, the core operation in many neural network layers.
Other large-scale specialized accelerators, such as the Cerebras Wafer Scale Engine (WSE), which integrates a massive number of cores on a single large chip for deep learning, also embody the principle of many specialized processing units tailored for AI workloads.
Continue here: 18.5 Decoupled Access-Execute