10 Microarchitecture Fundamentals and Design

Lecture from: 24.03.2023 | Video: YT

Today, we are shifting our focus significantly from the Instruction Set Architecture (ISA) to the Microarchitecture. If ISA is the contract between hardware and software, microarchitecture is the underlying implementation that fulfills that contract. We will explore the fundamental principles and design of microarchitectures, starting with the simplest forms and building towards more complex and performant designs.

Agenda for Today & Next Few Lectures

Our agenda continues to move down the “stack” from the higher levels of abstraction towards the physical implementation:

Instruction Set Architectures (ISA): LC-3 and MIPS (Covered)
Assembly programming: LC-3 and MIPS (Covered in Lecture 9b, relevant for labs)
Microarchitecture (principles & single-cycle uarch) (Today)
Multi-cycle microarchitecture (Today/Next)
Pipelining (Next)
Issues in Pipelining: Control & Data Dependence Handling, State Maintenance and Recovery (Next)
Out-of-Order Execution (Future)

Recall: ISA vs. Microarchitecture

Let’s briefly reiterate the core distinction between ISA and Microarchitecture:

What is part of ISA vs. Uarch?
- Gas pedal: interface for “acceleration” (This is analogous to the ISA - the programmer-visible contract).
- Internals of the engine: implement “acceleration” (This is analogous to the Microarchitecture - the underlying implementation details, often hidden).
Implementation (uarch) can be various as long as it satisfies the specification (ISA). Many different microarchitectures can implement the same ISA.
Microarchitecture usually changes faster than ISA. This is because microarchitectural innovations drive performance improvements, while maintaining a stable ISA is crucial for software compatibility.

Microarchitecture

Microarchitecture is the implementation of the ISA under specific design constraints and goals. It encompasses anything done in hardware without exposure to software, although as discussed earlier, the boundary can sometimes be blurred depending on the ISA definition.

Examples of microarchitectural concepts include:

Pipelining
In-order versus out-of-order instruction execution
Memory access scheduling policy
Speculative execution
Superscalar processing (executing multiple instructions per cycle)
Clock gating
Caching (Levels, size, associativity, replacement policy)
Prefetching
Voltage/frequency scaling
Error correction

These are design decisions hardware architects make to achieve performance, power efficiency, reliability, etc., based on the ISA specification.

Property of ISA vs. Uarch? (Revisited)

Let’s quickly review which level these properties typically belong to:

ADD instruction’s opcode: ISA (Programmer needs to know it)
Type of adder used in the ALU (Bit-serial vs. Ripple-carry): Uarch (Implementation detail)
Number of general purpose registers: ISA (Programmer needs to know how many registers they have)
Number of cycles to execute the MUL instruction: Typically Uarch, unless the ISA exposes pipeline details (like VLIW or some historical ISAs).
Number of ports to the register file: Uarch (Implementation detail of how registers are accessed)
Whether or not the machine employs pipelined instruction execution: Typically Uarch, unless the ISA exposes pipeline details.
Program counter: ISA (Programmer can manipulate it, PC-relative addressing depends on it).

Remember: Microarchitecture is the implementation of the ISA under specific design constraints and goals.

Design Point

A design point is defined by a set of design considerations and their importance. This point is determined by the “Problem” space (application space) and the intended users/market. The design point leads to trade-offs in both ISA and Uarch.

Example considerations include:

Cost
Performance
Maximum power consumption, thermal limits
Energy consumption (battery life)
Availability
Reliability and Correctness
Time to Market
Security, safety, predictability

Application Space

The application space is the set of tasks or workloads the computing system is designed for. As applications push boundaries, computing platforms become increasingly strained, driving the need for microarchitectural innovation.

Dream, and they will appear… (Applications)
Many other workloads: Genome analysis, Machine learning, Robotics, Web search, Graph analytics, etc.

This evolving application space constantly presents new challenges and dictates the required performance, power, and cost characteristics that microarchitects must strive for.

Tradeoffs: Soul of Computer Architecture

Computer architecture is the science and art of making the appropriate trade-offs to meet a design point. These trade-offs occur at multiple levels:

ISA-level tradeoffs
Microarchitecture-level tradeoffs
System and Task-level tradeoffs (e.g., How to divide the labor between hardware and software)

Why is it (somewhat) art? Because we do not (fully) know the future (applications, users, market). The future is not constant; it changes! Changing demands from the top and changing issues and capabilities at the bottom constantly shift the landscape.

This is analogous to macro-architecture (like buildings). A mill originally built for one purpose can be later used as a theater + restaurant + conference room, or an electric works becomes a coffee shop, or a church becomes a brewery. The original design provides constraints and possibilities for future, unforeseen uses.

Implementing the ISA: Microarchitecture Basics

Now that we have established the context, let’s think about the practicalities. How do we implement an ISA? How do we design a system that obeys the hardware/software interface?

We will assume a predominantly “completely hardware” implementation for most lectures, focusing on how hardware units execute instructions.

How Does a Machine Process Instructions?

What does processing an instruction mean? It means transforming the Architectural State (AS) (the programmer-visible state like registers and memory) before an instruction is processed into the Architectural State Prime (AS’) after the instruction is processed. This transformation must strictly adhere to the ISA specification.

Recall: The Von Neumann Model/Architecture & Architectural State

Let’s quickly recall the Von Neumann model’s key properties: stored program and sequential instruction processing. The Architectural State consists of the programmer-visible components: Memory (array of storage locations indexed by address), Registers (given special names in the ISA), and the Program Counter (memory address of the current or next instruction).

Instructions (and programs) specify how to transform the values of this programmer-visible state.

The “Process Instruction” Step: ISA vs. Microarchitecture View

ISA View: Specifies abstractly what AS’ should be, given an instruction and AS. It defines an abstract finite state machine where State = programmer-visible state and Next-state logic = instruction execution specification. From the ISA point of view, there are no “intermediate states” between AS and AS’ during instruction execution. One state transition per instruction.
Microarchitecture View: Implements how AS is transformed to AS’. There are many choices in implementation. We can have programmer-invisible state to optimize speed. This involves multiple state transitions per instruction using intermediate, programmer-invisible states.
- Choice 1: AS → AS’ (Transform in a single clock cycle).
- Choice 2: AS → AS+MS1 → AS+MS2 → AS+MS3 → AS’ (Takes multiple clock cycles, using microarchitectural states MS1, MS2, MS3).

A Very Basic Instruction Processing Engine: Single-Cycle Microarchitecture

The simplest microarchitectural approach is the Single-Cycle Machine. In this design:

Each instruction takes a single clock cycle to execute.
Only combinational logic is used to implement instruction execution.
There are no intermediate, programmer-invisible state updates between the start and end of the instruction’s execution in that cycle.

The Architectural State (AS) at the beginning of a clock cycle is transformed by combinational logic to produce AS’ at the end of the clock cycle. At the rising edge of the clock, AS is updated to AS’.

What determines the clock cycle time in this single-cycle machine? The critical path (the longest delay path) of the combinational logic. What determines the critical path? The longest delay required by any instruction through the combinational logic.

Single-cycle vs. Multi-cycle Machines (Comparison)

Single-cycle machine:
- Each instruction takes a single clock cycle.
- All state updates made at the end of an instruction’s execution.
- Big disadvantage: The slowest instruction determines cycle time → long clock cycle time.
Multi-cycle machine:
- Instruction processing broken into multiple cycles/stages.
- State updates can be made during an instruction’s execution (to microarchitectural state).
- Architectural state updates made at the end of an instruction’s execution (after all stages complete).
- Advantage over single-cycle: The slowest “stage” determines cycle time (not the slowest instruction overall) → short clock cycle time.

Both single-cycle and multi-cycle machines literally follow the Von Neumann model at the microarchitecture level (instructions finish sequentially).

Instruction Processing “Cycle” vs. Machine Clock Cycle

Let’s clarify terminology:

Instruction Processing “Cycle”: The sequence of conceptual steps to process an instruction (Fetch, Decode, Evaluate Address, Fetch Operands, Execute, Store Result - P&P has 5 steps). These are phases, not necessarily tied to a clock.
Machine Clock Cycle: The fundamental time unit of the microarchitecture, determined by the clock frequency.

In a single-cycle machine, all phases of the instruction processing cycle take a single machine clock cycle to complete. In a multi-cycle machine, all phases can take multiple machine clock cycles to complete. In fact, each phase can take multiple clock cycles.

Instruction Processing Viewed Another Way: Datapath and Control

An instruction processing engine fundamentally consists of two components:

Datapath: Hardware elements that deal with and transform data signals. Includes functional units (ALU), storage units (registers, memory), and hardware structures (wires, muxes, decoders, tri-state buffers) for data flow.
Control Logic: Hardware elements that determine control signals. These signals specify what the datapath elements should do to the data, orchestrating the data flow according to the ISA.

Single-cycle vs. Multi-cycle: Control & Data Flow

Single-cycle machine: Control signals are generated in the same clock cycle as the one during which data signals are operated on. Everything related to an instruction happens in one clock cycle (serialized processing within the cycle).
Multi-cycle machine: Control signals needed in the next cycle can be generated in the current cycle. Latency of control processing can be overlapped with latency of datapath operation (more parallelism between control and data flow across cycles).

Many Ways of Datapath and Control Design

There are many ways to design the datapath and control logic, leading to different microarchitectural styles:

Single-cycle, multi-cycle, pipelined datapath and control.
Single-bus vs. multi-bus datapaths.
Hardwired/combinational control vs. microcoded/microprogrammed control (control signals generated by logic vs. stored in memory).

Control signals and their structure depend heavily on the datapath design.

Flash-Forward: Performance Analysis

We measure processor performance (execution time) using the following formula:

Execution time of a single instruction: {CPI} x {clock cycle time} (CPI = Cycles Per Instruction).
Execution time of an entire program: Sum over all instructions [{CPI} x {clock cycle time}]
- Equivalent to: { # of instructions } x { Average CPI } x { clock cycle time }

For a single-cycle microarchitecture:

CPI = 1 (strictly).
Clock cycle time = long (determined by the slowest instruction).

For a multi-cycle microarchitecture:

CPI = different for each instruction.
Average CPI → hopefully small.
Clock cycle time = short (determined by the slowest stage).

In multi-cycle, we have two degrees of freedom (CPI and clock cycle time) to optimize independently, unlike single-cycle (only clock cycle time optimization possible, but it’s tied to the slowest instruction).

A Single-Cycle Microarchitecture From the Ground Up

Let’s build a single-cycle MIPS processor microarchitecture.

Let’s Start with the State Elements (MIPS)

The fundamental state elements we need are the Program Counter (PC), Instruction Memory, Register File, and Data Memory.

Program counter: 32-bit register, gets updated every cycle.
Instruction memory: Takes 32-bit address, outputs 32-bit instruction (Combinational read assumed for now).
Register file: 32 registers, each 32-bit. Has 2 read ports and 1 write port. (Combinational read, synchronous write assumed for now).
Data memory: Takes 32-bit address, reads/writes 32-bit data. Has a write enable (WE). (Combinational read, synchronous write assumed for now).

Assumption: We will assume ultra-fast memory and register file for initial design, and that reads are combinational (output changes immediately with address) while writes are synchronous (update happens on clock edge). We are not using a “Memory Ready?” signal for now.

Instruction Processing: 5 Generic Steps (P&H)

We break down instruction processing into 5 conceptual steps (combining Decode and Register Fetch):

Instruction fetch (IF)
Instruction decode and register operand fetch (ID/RF)
Execute/Evaluate memory address (EX/AG)
Memory operand fetch (MEM)
Store/writeback result (WB)

We need to design a datapath that supports the data flow for all instructions across these steps in a single clock cycle.

Designing the Datapath and Control Logic

We need to provide the Datapath and Control Logic to Execute All ISA Instructions. The control logic will generate the necessary control signals based on the instruction being executed to orchestrate the datapath.

What Is To Come: Single-Cycle MIPS Processor (Overall Diagram)

This is the complete single-cycle MIPS processor we will build towards. It includes the datapath elements and control signals (shown in orange).

Instructions

Single-Cycle Datapath for Arithmetic and Logical Instructions (R-Type & I-Type)

Let’s design the datapath to support R-Type and I-Type ALU instructions (like add, sub, and, or, addi, andi, ori).

R-type: 3 register operands (add $rd, $rs, $rt). Reads two registers ($rs, $rt), performs an ALU operation, writes result to a destination register ($rd).
I-type: 2 register operands and 1 immediate (addi $rt, $rs, imm). Reads one register ($rs), uses the sign-extended immediate, performs an ALU operation, writes result to a destination register ($rt).

MIPS Instruction Types

In MIPS, there are three main types of instructions based on their format and operation: R-type, I-type, and J-type. Each type serves a specific purpose in processing data and controlling program flow. Below is a breakdown of these types:

R-Type (Register-type) Instructions:

Operate between registers, with results typically stored in a register.

Common operations: add, sub, and, or, slt, etc.

Format: opcode | rs | rt | rd | shamt | funct

Example: add $t0, $t1, $t2 (adds contents of $t1 and $t2 and stores in $t0).

I-Type (Immediate-type) Instructions:

Involve a register and an immediate value (constant).

Common operations: addi, andi, lw, sw, beq, etc.

Format: opcode | rs | rt | immediate

Example: addi $t0, $t1, 10 (adds 10 to $t1 and stores the result in $t0).

J-Type (Jump-type) Instructions:

Used for jumps in program flow.

Common operations: j (jump), jal (jump and link).

Format: opcode | address

Example: j 0x00400000 (jumps to the instruction at address 0x00400000).

The datapath needs:

PC to fetch instructions.
Instruction Memory to read instructions.
Register File to read source operands ($rs, $rt) and write results ($rd or $rt).
Sign Extension unit for I-type immediate.
ALU to perform operations.
Multiplexers to select ALU inputs (register or sign-extended immediate) and the register write data source.
Control signals to direct these elements (RegWrite, ALUSrc, ALUOp, RegDst).

Single-Cycle Datapath for Data Movement Instructions (lw, sw)

Now let’s add support for load and store instructions.

lw (Load Word): I-type (lw $rt, offset($base)). Reads base register ( $ba se$ ), adds sign-extended offset (ALU performs addition for address calculation), reads Data Memory at that address, writes memory data to destination register ( $r t$ ).
sw (Store Word): I-type (sw $rt, offset($base)). Reads base register ( $ba se$ ), adds sign-extended offset (ALU performs addition for address calculation), reads source register ( $r t$ ) for data, writes data to Data Memory at the calculated address.

The datapath needs:

The existing R/I-type datapath components.
ALU output needs to connect to Data Memory address input.
Data Memory read output needs to connect to the Register File write data input (via a multiplexer).
Register File read output (for $r t$ data in sw) needs to connect to the Data Memory write data input.
Control signals (MemRead, MemWrite, MemtoReg, RegWrite, ALUSrc, ALUOp, RegDst).

Single-Cycle Datapath for Control Flow Instructions (j, beq)

Finally, let’s add control flow instructions.

j (Jump): J-type (j target). Calculates target address (PC + jump immediate bits shifted), updates PC to target address.
beq (Branch if Equal): I-type (beq $rs, $rt, offset). Reads two registers ( $rs,$ rt$), compares them (ALU performs subtraction/comparison), calculates branch target address (PC+4 + branch offset shifted). If registers are equal (ALU Zero output is true), updates PC to branch target; otherwise, updates PC to PC+4.

The datapath needs:

Jump address calculation logic.
Branch target address calculation logic (using the ALU for comparison and an Adder for target calculation).
Multiplexer at the PC input to select the next PC source (PC+4, Jump target, Branch target).
Control signals (PCSrc, Branch).

Putting It All Together: Complete Single-Cycle Datapath

Combining all these instruction types requires building a unified datapath with appropriate multiplexers and control signals.

Single-Cycle Control Logic: Hardwired Control

Now that we have the datapath, we need to design the Control Logic. In a single-cycle hardwired design, the control signals are generated as a combinational function of the instruction bits (primarily the opcode) and potentially some datapath status signals (like the ALU Zero output for branches).

We need to determine the value of each control signal for each instruction type.

Single-Bit Control Signals (Tables)

We can create a table mapping opcodes (and relevant function codes for R-type, etc.) to the required value for each control signal (asserted=1, de-asserted=0, or value).

RegWrite: Asserted for instructions that write to the register file (R-type, lw, addi, etc.), de-asserted otherwise (sw, beq, j, etc.).
RegDst: Selects the destination register ID source (R-type uses bits 15:11, I-type uses bits 20:16).
ALUSrc: Selects the second ALU input (Register file output for R-type, Sign-extended immediate for I-type ALU and Load/Store address calculation).
Branch: Asserted for branch instructions.
MemWrite: Asserted for store instructions (sw).
MemtoReg: Selects the source for register write data (ALU result or Data Memory read data). Asserted for load instructions (lw), de-asserted otherwise (R-type, I-type ALU, addi, etc.).
ALUOp: Determines the specific ALU operation (Add, Subtract, AND, OR, etc.). This can be a direct mapping for I-type ALU/Load/Store address calculation, or dependent on the R-type function code.
MemRead: Asserted for load instructions (lw).
PCSrc: Selects the next PC source (PC+4, Branch target, Jump target). Depends on instruction type and branch condition.