18.5 Decoupled Access-Execute

Lecture from: 06.05.2023 | Video: YT

This lecture introduced Decoupled Access-Execute (DAE) as another paradigm for extracting instruction-level concurrency, distinct from pipelining, Out-of-Order (OoO) execution, VLIW, and systolic arrays. DAE proposes a unique hardware-software co-design approach focusing on separating memory access and execution.

Decoupled Access-Execute (DAE) Concept

Decoupled Access-Execute (DAE) architectures arose from the perception in the early 1980s that dynamically scheduled processors (like those based on Tomasulo’s algorithm) were overly complex to implement at the time, before the advent of widely successful Out-of-Order designs like the Intel Pentium Pro.

The core idea of DAE is to decouple the instruction stream into two separate streams: an Access stream and an Execute stream.

The Access stream (executed by an Access Processor) primarily handles memory operations (loads and stores), including address calculations and data fetching.
The Execute stream (executed by an Execute Processor) handles computational operations (arithmetic, logic, etc.) and potentially control flow.

These two streams communicate and synchronize through ISA-visible queues. Load instructions in the Access stream, upon fetching data from memory, deposit it into a data queue visible to the Execute stream. Conversely, Execute stream instructions that produce values needed for address calculations in the Access stream deposit those values into a queue visible to the Access stream. Control flow synchronization (like branches) is handled via a separate branch queue.

This separation allows the Access and Execute processors to run relatively asynchronously from each other. If the Execute processor is stalled waiting for data from memory, the Access processor can continue running ahead, prefetching data into the data queue. If the Access processor is stalled (e.g., waiting for a value from the Execute stream needed to compute an address), the Execute processor can run ahead, performing computations using data already available in its input queues.

This dynamic asynchronous execution provides latency tolerance without requiring the complex dynamic scheduling hardware (like reservation stations and wake-up/select logic) found in OoO processors. Communication through queues, which can be scaled in size, replaces the complex broadcast networks and dependence checking logic.

The partitioning of a single instruction stream into A and E streams is typically performed by the compiler. The compiler analyzes the program, identifies operations belonging to each stream, and generates two distinct instruction sequences that explicitly communicate via queue operations inserted by the compiler.

Advantages and Disadvantages of DAE

The Decoupled Access-Execute paradigm offers several advantages:

Latency Tolerance: The asynchronous execution of the Access and Execute streams allows for tolerance of memory latency (if the Access stream can run ahead) and computation latency (if the Execute stream can run ahead). This is a key benefit over traditional in-order pipelines.
Queue-based Communication: Communication via ISA-visible queues is simpler to implement and potentially more scalable than the complex tag-matching and broadcast mechanisms in OoO processors. Queues reduce the need for a large centralized physical register file to hold all speculative values.
Potential for Specialization: The Access and Execute processors can be specialized and optimized for their respective tasks (e.g., the Access processor could have specialized address calculation units, while the Execute processor could focus on arithmetic pipelines).

However, DAE also has significant disadvantages:

Compiler Support: DAE relies heavily on the compiler to effectively partition the program into two streams and manage queue communication and synchronization. The quality of the compiler’s partitioning directly impacts performance.
Synchronization Complexity: Branch instructions require synchronization between the A and E streams to ensure that both streams follow the correct control flow path. This adds complexity to both the compiler and the hardware.
Instruction Stream Management: Managing two separate instruction streams introduces overhead compared to a single sequential stream. Early DAE implementations generated two explicit streams, but later work explored dynamically steering a single stream into A and E pipelines.

Examples

Astronautics ZS-1: The Astronautics ZS-1 processor, designed by James E. Smith, is an example of a DAE machine that dynamically steers instructions from a single stream into separate Access (A) and Execute (X) pipelines. These pipelines operate in order internally but are decoupled from each other and communicate via queues.

Loop Unrolling

Loop unrolling is a compiler optimization technique that replicates the body of a loop multiple times within a single iteration. This reduces the number of loop control branches and overhead and increases the size of basic blocks, providing more opportunities for compilers (in VLIW or DAE) or hardware (in OoO) to find independent instructions and parallelize execution.

Decoupling in Modern OoO

While not implemented as separate, ISA-visible streams, the principle of decoupling memory access and execution units persists even in modern OoO processors. Load/Store Units (responsible for memory access) and Execution Units (responsible for computation) operate somewhat independently and communicate via internal buffers (Load Queue, Store Queue). The OoO scheduler can issue memory operations separately from computational operations, leveraging some degree of decoupling.

These separate queues and execution clusters allow for specialized pipelines for memory and computation and enable some degree of asynchronous operation between them, even within the overall OoO framework.

Conclusion

Decoupled Access-Execute architectures propose separating a program’s instruction stream into Access and Execute streams to overlap memory access and computation, tolerating latencies without the full complexity of Out-of-Order dynamic scheduling. While the pure DAE model with ISA-visible queues has not become dominant in general-purpose CPUs, its core principle of decoupling and asynchronous operation between different parts of the microarchitecture, particularly memory and execution units, is present to some extent even in modern OoO processors. DAE highlights a different point in the hardware-software design space, emphasizing compiler responsibility for partitioning and orchestration, and Queue-based communication as an alternative to complex broadcast networks.

Continue here: 19 SIMD Architectures (Vector and Array Processors)

CS Notes

Explorer

18.5 Decoupled Access-Execute