11 Multi-Cycle Microarchitecture Design

Lecture from: 30.03.2023 | Video: YT

Today, we delve deeper into the realm of Microarchitecture Design. Building upon our introduction to microarchitecture and the exploration of the simple single-cycle design last lecture, we will now examine the limitations of that approach and embark on designing more sophisticated multi-cycle microarchitectures. Our goal is to understand how processors execute instructions efficiently by breaking down the process into multiple steps, orchestrated by control logic.

Agenda for Today & Next Few Lectures

Our progression through the levels of abstraction continues. Today, we complete our discussion on multi-cycle microarchitectures. Next lecture, we will introduce pipelining, a crucial technique for improving processor performance, and then delve into the complexities and issues associated with it. Following the Easter break, we will explore out-of-order execution, a more advanced microarchitectural technique.

Instruction Set Architectures (ISA): LC-3 and MIPS (Covered)
Assembly programming: LC-3 and MIPS (Covered in Lecture 9b, relevant for labs)
Microarchitecture (principles & single-cycle uarch) (Started Last Lecture)
Multi-cycle microarchitecture (Today)
Pipelining (Next Lecture)
Issues in Pipelining: Control & Data Dependence Handling, State Maintenance and Recovery (Following Next Lecture)
Out-of-Order Execution (After Easter Break)

Implementing the ISA: Microarchitecture Basics (Recap)

Let’s quickly recall the fundamental concepts we established in the last lecture.

The Microarchitecture is the hardware implementation of the Instruction Set Architecture (ISA), designed to meet specific performance, power, cost, and other goals. It encompasses all the hardware details that are not visible to the software programmer.

Recall: A Very Basic Instruction Processing Engine (Single-Cycle)

We started by designing the simplest possible microarchitecture: the Single-Cycle Machine. In this design, each instruction is executed in a single clock cycle. The entire transformation of the Architectural State (AS) from before the instruction to after (AS’) happens within this one cycle, using only combinational logic. There are no intermediate, programmer-invisible state updates during the instruction’s execution.

Recall: The Instruction Processing “Cycle” vs. Machine Clock Cycle

It’s important to distinguish between the conceptual steps of processing an instruction (Fetch, Decode, Execute, etc., sometimes referred to as the Instruction Processing “Cycle”) and the physical Machine Clock Cycle.

In a Single-Cycle Machine, all phases of the instruction processing cycle (Fetch, Decode, etc.) must complete within a single machine clock cycle. In contrast, a Multi-Cycle Machine allows these phases to take multiple machine clock cycles. Each phase, in turn, might even take multiple clock cycles depending on the design.

Recall: Datapath and Control Logic

Any instruction processing engine, whether single-cycle or multi-cycle, fundamentally consists of two main parts:

Datapath: The hardware components that perform data manipulation and transfer. This includes functional units (like the ALU), storage units (registers, memory), and interconnections (wires, multiplexers, decoders, tri-state buffers) that route the data.
Control Logic: The hardware responsible for generating the control signals that direct the datapath elements. These signals tell the datapath what operations to perform and how to route the data according to the instruction being executed.

Single-Cycle Microarchitecture: From the Ground Up (Review)

Last lecture, we built a single-cycle MIPS processor microarchitecture. Let’s revisit its structure and the principles we used.

Single-Cycle MIPS Datapath Review

The single-cycle MIPS datapath we designed included components like the Program Counter (PC), Instruction Memory, Register File, ALU, Data Memory, sign extension units, and various multiplexers to route data. For instance, instructions are fetched from memory, registers are read, the ALU performs computations or address calculations, memory is accessed for loads/stores, and results are written back to the register file.

In this single-cycle design, all these operations for a given instruction happen in one continuous flow through the combinational logic within a single clock cycle. The control logic generates all the necessary signals based on the instruction to enable this entire flow in one go.

Evaluating the Single-Cycle Microarchitecture

Now, let’s critically evaluate the single-cycle design we built. Is it a good design? When is it good, and when is it bad? How can we design something better?

Performance Analysis Basics (Recap)

To evaluate microarchitectures quantitatively, we use performance metrics. The execution time of a program is fundamentally determined by:

Execution Time = (Number of Instructions) × (Average Cycles Per Instruction) × (Clock Cycle Time)

Number of Instructions (N): The count of dynamic instructions executed by the program. This is largely determined by the ISA and the compiler.
Cycles Per Instruction (CPI): The average number of machine clock cycles required to execute one instruction. This is heavily influenced by the microarchitecture.
Clock Cycle Time (T): The duration of one machine clock cycle, typically limited by the longest combinational path in the hardware (the critical path). This is a microarchitectural design parameter, limited by the technology and design.

Analyzing the Single-Cycle Microarchitecture’s Performance

Let’s apply this framework to the single-cycle design.

CPI: In a single-cycle microarchitecture, by definition, every instruction takes exactly 1 clock cycle to complete. So, CPI is strictly 1.
Clock Cycle Time (T): Since all instruction processing must complete within one cycle, the clock cycle time is determined by the delay of the slowest instruction through the combinational logic of the datapath. This is the critical path.

Identifying the Slowest Instruction (Critical Path Analysis)

To find the clock cycle time, we need to determine the critical path. The critical path is the longest delay path through the combinational logic, from a state element output to a state element input, within one clock cycle. For our single-cycle MIPS datapath, the critical path for each instruction type involves traversing different components (Instruction Memory, Register File, ALU, Data Memory, etc.).

Let’s use the component delays we assumed last lecture:

Memory (read/write): 200 ps
ALU/Adders: 100 ps
Register File (read/write): 50 ps
Other logic/wires: 0 ps (idealized)

Tracing the critical path for different instruction types:

Based on these component delays, the longest path is 600 ps for the Load Word (LW) instruction. Therefore, the clock cycle time (T) for this single-cycle microarchitecture must be at least 600 ps. This means all instructions, even a simple Jump (200 ps), must take a full 600 ps clock cycle to complete.

Impact of Control Logic on Critical Path

While our simple model assumes control logic has zero delay, in reality, the time it takes to generate control signals can also contribute to the critical path, especially if the control logic is complex or involves accessing slower components like a control store memory (as in some historical microprogrammed designs).

The Real World: Memory is Slow!

Our assumption of 200 ps for memory access is highly optimistic for realistic main memory. If memory access sometimes takes significantly longer (e.g., 150 nanoseconds), this fundamentally dictates the clock cycle time.

If memory access takes 150 nanoseconds (150,000 picoseconds) and a Load instruction is on the critical path, the clock cycle would need to be at least 150 ns + other delays. This means a simple register-to-register addition or a jump instruction would also take over 150 nanoseconds to complete, which is extremely inefficient compared to how fast an ALU operation or PC update could actually be performed.

Furthermore, if an instruction requires multiple memory accesses (like an indirect load or complex string operations), a single-cycle design would need multiple memory ports and parallel data paths, adding significant hardware complexity and still being limited by the slowest access time.

Single-Cycle uArch: Summary of Downsides

The single-cycle microarchitecture, while simple conceptually, has significant practical drawbacks:

Contrived and Inefficient: All instructions are forced to run at the speed of the slowest instruction. There is significant waste of time within each clock cycle for faster instructions.
High Hardware Cost: Resources needed by an instruction in different phases of its execution (like the ALU or memory) must be duplicated if the instruction needs them multiple times within the same clock cycle. This adds unnecessary hardware.
Not Suitable for Complex ISAs: Implementing complex instructions with multiple operations and memory accesses in a single clock cycle would require immense and impractical combinational logic.
Difficult to Optimize: You cannot effectively optimize the performance of frequently occurring, simple instructions because their speed is bottlenecked by the slowest instruction. Optimization effort must focus on the rare, slow cases, which is often inefficient.

Single-Cycle Design Violates Core Microarchitecture Design Principles

These downsides show that the single-cycle design fundamentally violates several key microarchitecture design principles:

Critical Path Design: It fails to minimize the clock cycle time by setting it according to the overall longest instruction delay, rather than the delay of smaller, balanced stages.
Bread and Butter (Common Case) Design: It prioritizes the performance of the worst-case, infrequent instruction over the common, frequent instructions. Optimizing the common case has little impact on performance.
Balanced Design: It does not balance the workload across hardware resources over time. Resources needed sequentially by an instruction must be duplicated for concurrent use within the single cycle, leading to unbalanced and potentially inefficient hardware utilization.

Can We Do Better? Multi-Cycle Microarchitectures

Given the severe limitations of the single-cycle approach, especially concerning clock speed and hardware efficiency for realistic ISAs and memory latencies, we need a better solution. The answer is Multi-Cycle Microarchitectures.

Multi-Cycle Microarchitectures: Goal and Idea

Goal: Let each instruction take approximately only as much time as it really needs to complete. Avoid wasting time within clock cycles and avoid forcing all instructions to run at the speed of the slowest.
Idea: Break down the instruction processing into multiple, smaller, clock cycles or “states”. By doing so, we can determine the clock cycle time independently of the total time it takes to process any single instruction. Each instruction can then take as many clock cycles as it needs, executing a sequence of microarchitectural states tailored to that instruction.

The “Process Instruction” Step in Multi-Cycle (Recap)

As discussed last lecture, the transformation of the Architectural State (AS to AS’) for an instruction in a multi-cycle machine happens over a sequence of machine clock cycles, transitioning through programmer-invisible microarchitectural states (MS).

Multi-Cycle Realization: The Finite State Machine

The core idea for realizing a multi-cycle microarchitecture is to implement the instruction processing flow as a Finite State Machine (FSM). The FSM sequences through states, each corresponding to one machine clock cycle (or potentially multiple for variable-latency operations).

A Basic Multi-Cycle Microarchitecture: FSM Control

In a multi-cycle design controlled by an FSM:

The conceptual instruction processing cycle (Fetch, Decode, Execute, etc.) is divided into discrete “states” or stages within the FSM. A single phase like “Fetch” can take multiple states (clock cycles).
The microarchitecture sequences from state to state to process an instruction. The transitions in the FSM represent the sequence of operations for each instruction type.
The behavior of the machine in any given state (clock cycle) is entirely determined by the control signals asserted in that state. These signals direct the datapath.
The logic also determines the control signals needed for the next state (next clock cycle), enabling overlap of control and datapath operations across cycles.

Recall: LC-3 FSM as a Multi-Cycle Example

The LC-3 processor microarchitecture we introduced is a classic example of a multi-cycle design controlled by an FSM. We saw how fetching an instruction involved sequencing through multiple states (State 18/19 → 33 → 35) before decoding could even begin (State 32). Each state specifies a limited set of operations performed in that clock cycle.

Benefits of Multi-Cycle Design

Multi-cycle designs successfully address the core problems of single-cycle microarchitectures:

Improved Critical Path Design: By breaking down complex operations into shorter states, we can reduce the longest combinational path, leading to a shorter clock cycle time and higher clock frequency.
Optimized Common Case: The FSM allows frequently executed, simple instructions to take fewer cycles than complex ones. We can spend design effort optimizing the number of states for common instructions to improve overall performance.
Balanced Hardware Utilization: Hardware resources (like ALUs or memory ports) can be reused across different cycles within an instruction’s execution sequence. This reduces the need for duplicating hardware, leading to a more balanced and cost-effective design.

Downsides of Multi-Cycle Design

Multi-cycle designs are not without their trade-offs:

Hardware Overhead: Additional registers (microarchitectural state) are required to store intermediate results between cycles.
Sequencing Overhead: There is a penalty associated with latching data into these registers at the end of each cycle (register setup/hold time), which is incurred multiple times per instruction.

Multi-Cycle LC-3 Data Path (Intermediate Registers)

The LC-3 datapath clearly shows these additional intermediate registers (MAR, MDR, IR) that are necessary to hold values between the clock cycles of an instruction’s execution. These are the overhead compared to a conceptual single-cycle implementation.

Performance Analysis: Multi-Cycle vs. Single-Cycle (Reiteration)

Execution Time = (Number of Instructions) × (Average Cycles Per Instruction) × (Clock Cycle Time)

Single-Cycle: CPI = 1 (fixed). Clock Cycle Time = Long (determined by the slowest instruction).
Multi-Cycle: CPI = Varies per instruction (average > 1). Clock Cycle Time = Short (determined by the slowest stage in the FSM).

The hope is that the significant reduction in clock cycle time in a multi-cycle design will outweigh the increase in average CPI, resulting in lower overall execution time for a program.

A Multi-Cycle Microarchitecture: Constructing the MIPS Design

Let’s now construct a multi-cycle MIPS microarchitecture, similar to the one in our readings, to see how these principles are applied.

Why Multi-Cycle? Key Optimizations

Reviewing our motivations:

Reduce hardware cost: Use a single memory instead of two, and a single ALU/adder instead of three.
Improve clock frequency: Determine clock cycle time based on the fastest feasible stage delay, not the slowest instruction’s total delay.
Improve performance: Simple instructions can take fewer cycles than complex ones.

Constructing the Multi-Cycle Datapath

We will build the multi-cycle MIPS datapath step-by-step, adding components as needed to support different instruction types, and ensuring resources can be reused across cycles.

We start by supporting the crucial Load Word (LW) instruction, which was the critical path bottleneck in our single-cycle analysis. Recall the steps for LW: Fetch Instruction, Read Registers, Calculate Address, Read Data Memory, Write to Register File. Each of these steps will roughly correspond to a state (or possibly more) in our FSM.

Instruction Fetch (State 0): Fetch the instruction from memory using the PC. We add an Instruction Register (IR) to latch the fetched instruction at the end of this cycle. We also increment the PC, perhaps using the ALU, for the next instruction fetch.
Decode and Register Fetch (State 1): Decode the instruction in the IR. Read the source registers (rs, rt) from the Register File, specified by fields in the IR. We add registers to latch these register values (let’s call them Reg1 and Reg2, or similar) at the end of this cycle. Sign-extend the immediate field from the IR.
Execute / Address Calculation (State 2 for LW/SW): The ALU performs the base register + sign-extended offset calculation for LW/SW address computation. This uses the base register latched in State 1 and the sign-extended immediate also available from State 1. The result (the memory address) is latched in an intermediate register (perhaps the ALUOut register). This demonstrates reusing the ALU for address calculation.
Memory Access (State 3 for LW, State 4 for SW): For LW, read the Data Memory using the address latched in State 2. The data read is latched in another intermediate register (let’s call it Memory Data Register or MDR). This demonstrates reusing the single memory for both instruction fetch and data access.
Write Back (State 4 for LW): Write the data from the MDR (latched in State 3) into the destination register, specified by the rt field in the IR.

By breaking down LW into these cycles, we’ve used a single memory and a single ALU for its execution, adding intermediate registers to hold values between cycles. We’ve also shown how PC increment can be done using the same ALU in a different cycle.

Other instructions are implemented similarly, using different sequences of states and control signals:

Store Word (SW): Similar to LW, but State 3 reads the data to be written (from Reg2 latched in State 1), and State 4 writes this data from Reg2 to memory using the address latched in State 2. There is no State 5 write back to the register file for SW.
R-Type (Add, Sub, etc.): Fetch, Decode/Reg Fetch (States 0 & 1). Then Execute (State 2 for R-type): ALU performs operation on values latched in State 1. Result latched in ALUOut register. Then Write Back (State 3 for R-type): Write ALUOut to destination register (specified by rd field in IR, requiring a multiplexer to select rd vs rt).
Branch Equal (beq): Fetch, Decode/Reg Fetch (States 0 & 1). Then Execute (State 2 for Branch): ALU subtracts the two register values (latched in State 1) and generates the Zero flag. In parallel, an adder (or the ALU in a different cycle if further optimized) calculates the branch target address (PC+4 + sign-extended, shifted immediate). Then Branch (State 3 for beq): Check the Zero flag and the Branch control signal. If true, update the PC with the target address (calculated in State 2); otherwise, PC already has PC+4.

Complete Multi-Cycle MIPS Datapath

The full multi-cycle MIPS datapath combines the components and interconnections needed for all instruction types. It includes the intermediate registers (IR, Reg1, Reg2, ALUOut, MDR) and multiplexers to route data appropriately in each state.

Compared to the single-cycle datapath, we have:

Added intermediate registers.
Reduced the number of separate memories to one.
Reduced the number of separate ALUs/adders to one.
Added multiplexers to reuse components.

Constructing the Multi-Cycle Control Logic (FSM)

The control logic for a multi-cycle processor is a Finite State Machine (FSM). The FSM transitions from state to state, and in each state, it asserts the specific control signals required by the datapath for that stage of instruction processing. The next state is determined by the current state, the instruction opcode (and function code), and potentially some datapath conditions (like ALU Zero for branches, or memory Ready).

Main Controller FSM States

Let’s trace the FSM states for a few instruction types:

Fetch (State 0):
- Actions: Put PC on memory address bus (IorD=0), Read memory (MemRead=1), Write to IR (IRWrite=1). Increment PC (using ALU, ALUSrcA=0, ALUSrcB=01, ALUOp=00, PCWrite=1).
- Next State: Unconditionally transition to Decode (State 1).
Decode (State 1):
- Actions: Read registers specified by IR fields (RegRead=1, sources determined by muxes controlled by IR fields). Sign-extend immediate.
- Next State: Transition based on instruction opcode (multi-way branch). For example, go to Address Calculation State (State 2) for LW/SW, Execute State (State 6) for R-type, Branch State (State 8) for beq, Jump State (State 11) for j.
Address Calculation (State 2 for LW/SW):
- Actions: ALU performs addition (ALUSrcA=1 (from Reg1), ALUSrcB=10 (sign-extended immediate), ALUOp=00). Result latched in ALUOut.
- Next State: Transition to Memory Read (State 3) for LW, or Memory Write (State 4) for SW.
Memory Read (State 3 for LW):
- Actions: Put ALUOut (the address) on memory address bus (IorD=1), Read memory (MemRead=1). Result latched in MDR.
- Next State: Transition to Write Back (State 4).
Memory Write (State 4 for SW):
- Actions: Put ALUOut (the address) on memory address bus (IorD=1), Put data from Reg2 (latched in State 1) on memory write data bus (via multiplexer not shown in FSM diagrams). Write to memory (MemWrite=1).
- Next State: Transition back to Fetch (State 0) to process the next instruction.
R-Type Execute (State 6):
- Actions: ALU performs operation specified by function code (ALUSrcA=1 (from Reg1), ALUSrcB=00 (from Reg2), ALUOp determined by function code). Result latched in ALUOut.
- Next State: Transition to R-Type Write Back (State 7).
R-Type Write Back (State 7):
- Actions: Write data from ALUOut to destination register (RegWrite=1, RegDst=1 (select rd field), MemtoReg=0 (select ALUOut)).
- Next State: Transition back to Fetch (State 0).
Branch Equal (State 8):
- Actions: ALU subtracts registers (ALUSrcA=1, ALUSrcB=00, ALUOp=01). Calculate branch target address (PC+4 + sign-extended, shifted immediate).
- Next State: If ALU Zero is true AND Branch signal is asserted, transition to Fetch (State 0) after updating PC with target address (PCSrc=1, PCWrite=1). If ALU Zero is false, transition to Fetch (State 0) after updating PC with PC+4 (PCWrite=1, PCSrc=0 - which happened in State 0 fetch).

This finite state machine completely defines the execution of each instruction in the multi-cycle microarchitecture by sequencing through the appropriate states and asserting the necessary control signals in each state.

Review: Complete Multi-Cycle MIPS Microarchitecture

This diagram shows the complete multi-cycle MIPS processor based on the design in our readings, integrating the datapath and the FSM control logic.

This MIPS multi-cycle FSM has 12 states (0-11) to implement the core instructions. Instructions take a variable number of cycles:

R-type: Fetch (1) + Decode (1) + Execute (1) + Write Back (1) = 4 cycles
LW: Fetch (1) + Decode (1) + Addr Calc (1) + Mem Read (1) + Write Back (1) = 5 cycles
SW: Fetch (1) + Decode (1) + Addr Calc (1) + Mem Write (1) = 4 cycles
Beq: Fetch (1) + Decode (1) + Execute (1) + Branch (1) = 4 cycles
Jump: Fetch (1) + Decode (1) + Jump (1) = 3 cycles

(Note: CPI calculations here might slightly differ based on how states are grouped or counted, but the relative cycle counts show variability).

Handling Variable-Latency Memory

A true multi-cycle design, like the LC-3 FSM, handles variable-latency memory by allowing the FSM to stay in the memory access state for multiple clock cycles until the memory indicates that the data is ready (e.g., via a “Memory Ready?” signal, R).

This “Memory Ready?” signal is an input to the control logic (microsequencer) that dictates the FSM’s next state transition. Until R is asserted, the FSM remains in the memory access state. This allows the clock cycle time to be fast, independent of the potentially long and variable memory latency.

Microprogrammed Multi-Cycle Microarchitecture: A Deeper Look

The concept of a microprogrammed control unit provides an elegant way to realize the FSM control logic for a multi-cycle processor.

Microprogrammed Control Terminology

Microinstruction: The set of control signals for a specific state in the FSM.
Control Store: A memory that stores the microinstructions for all states.
Microsequencer: Logic that determines the address of the next microinstruction (the next state) based on the current state, instruction bits, and datapath conditions.
Microsequencing: The process of transitioning from one microinstruction (state) to the next.

Example Microprogrammed Control Structure

The microsequencer calculates the address of the next microinstruction, which is then fetched from the control store and applied as control signals to the datapath for the current cycle.

How the Microsequencer Determines the Next State

The microsequencer is the heart of the control flow in a microprogrammed processor. It determines the address of the next microinstruction (next state) based on inputs such as:

Bits from the current microinstruction (e.g., jump address bits, branch enable bits).
Bits from the Instruction Register (e.g., opcode, function code) to decode the instruction type and select the starting state sequence.
Condition codes or status bits from the datapath (e.g., ALU Zero, Carry, Negative, Memory Ready) for conditional sequencing.

This flexible microsequencing allows the FSM to follow complex paths, including multi-way branches based on opcodes, conditional branches based on data results, and waiting for external events like memory readiness.

The Power of Abstraction (Microprogramming)

Microprogramming provides a powerful abstraction layer for the microarchitect. The ISA’s instructions are translated into sequences of microinstructions stored in the control store. This allows a relatively simple datapath to implement a complex ISA by sequencing through different micro-routines.

This view of microinstructions as a “user-invisible ISA” allows the hardware designer to implement new features or fix bugs by simply changing the microcode in the control store, potentially even after the processor has been shipped (if the control store is writable or patchable). This flexibility was a major advantage of early microprogrammed processors.

Multi-Cycle vs. Single-Cycle uArch: Final Summary

In summary, multi-cycle microarchitectures offer significant advantages over single-cycle designs by allowing for faster clock speeds (determined by stage delay), better hardware utilization (component reuse), and easier implementation of complex instructions. They achieve this by breaking instruction processing into multiple clock cycles, orchestrated by a Finite State Machine (often implemented with microprogrammed control), at the cost of added intermediate registers and sequencing overhead.

Segue into Pipelining

While multi-cycle designs improve upon single-cycle by reducing clock cycle time and allowing variable execution cycles, they still execute only one instruction at a time. In the next lecture, we will introduce Pipelining, a technique that significantly boosts performance by allowing multiple instructions to be processed concurrently in different stages of their execution.

Continue here: 12 Pipelining, Data Dependencies

CS Notes

Explorer