07 Basic Concepts in Parallelism

Lecture from: 11.03.2024 | Video: Video ETHZ

Digital Character AI

The goal of Digital Character AI is to create intelligent, interactive figures capable of engaging in conversations and exhibiting realistic behaviors. These figures can range from photorealistic representations of real people to stylized artistic interpretations.

Historical Context: Eliza

The journey towards intelligent digital characters began with Eliza.

Eliza, one of the first attempts, used pattern matching techniques to simulate conversation.

Explanation of the pattern-matching techniques that Eliza employed.

Modern Embodiments: Digital Einstein

A modern example of a sophisticated Digital Character AI is the Digital Einstein project.

This project provides a real-time VR system to simulate conversations with Albert Einstein.

Key Components of a Digital Character AI

Creating a truly interactive digital character requires a complex interplay of several components:

Display: The technology used to render the character (e.g., VR headset, screen).
3D Character and Animation/Visualization: Creating a realistic or stylized 3D model and animating its movements.
Audio and Language Processing: Handling speech recognition, text-to-speech, and natural language understanding.
Dialogue Tree and Advanced Algorithms: Constructing a dialogue structure and implementing algorithms to guide the conversation (similar to ChatGPT).
Intention Detection: Interpreting non-verbal cues such as gestures and facial expressions to understand the user’s intent. Okay, let’s continue our exploration of parallel architectures, building upon the concepts of locality and Instruction Level Parallelism (ILP) that we touched on previously. Remember, the overarching goal is to understand how hardware is designed to execute programs faster.

Recap: Locality and ILP

Briefly, let’s recap two key ideas:

Locality and Caches: Jumps in memory access patterns are detrimental to cache performance. Caches work best when data is accessed sequentially or in close proximity. Large, unpredictable jumps force the cache to constantly fetch new data from main memory, slowing things down considerably. Think of it like reading a book – it’s much faster if you read pages sequentially than if you constantly flip back and forth between distant chapters.
Locality and ILP: Jumps in the code itself (like conditional branches) hinder Instruction Level Parallelism (ILP). Modern CPUs try to execute multiple instructions concurrently. However, a branch instruction creates uncertainty. The CPU doesn’t know which path the code will take (the “if” branch or the “else” branch) until the branch condition is evaluated. This uncertainty can stall the pipeline, preventing the CPU from achieving maximum parallelism. It’s like a factory assembly line – if you suddenly change the instructions, the line has to pause while it figures out what to do next.

Reordering Instructions

Now, let’s introduce another powerful technique that CPUs use to improve performance: instruction reordering.

Consider this simple C code snippet. bar is the function in consideration. The variable result gets its value from the addition of another variable, something, and a function done. Then the done function is set to 1.

At first glance, it seems like these instructions must be executed in the order they appear in the code. However, a clever compiler (especially with optimizations enabled, like the -O3 flag in GCC) might realize that it can rearrange things.

Look carefully at what’s happening here. The compiler has reordered the instructions:

Load done: The value of done is loaded into a register before something is even touched.
Add something: The value of something is added to the register (which now holds the initial value of done).
Set done to 1: the done variable is set to 1
Store to result: The final result is stored in the result variable.

Why is this beneficial? The key is dependency analysis. The compiler (or the CPU itself, in some cases) recognizes that the initial value of done and the value of something are independent. They don’t rely on each other. Therefore, the CPU can start working on loading done while it’s also fetching something. This overlap hides some of the latency associated with memory access.

This reordering is a form of Out-of-Order Execution (OoOE). The CPU executes instructions not strictly in the order they appear in the program, but in an order that maximizes efficiency, as long as the final result is the same.

The Importance of Memory Models

However, this freedom to reorder instructions introduces a critical challenge, especially in the context of multithreaded programs.

Imagine another thread is also accessing and modifying the done variable. If we reorder instructions carelessly, we might get incorrect results due to race conditions and unexpected interleavings of operations from different threads.

This is where memory models come into play. A memory model is a formal specification that defines precisely what reorderings are allowed and how memory operations from different threads interact.

Hardware Designers: Create the memory model, defining the rules of the game for how the hardware can reorder memory accesses.
Compiler Designers: Use this memory model to generate correct code, ensuring that the compiler’s optimizations don’t violate the hardware’s rules.
Programmers: (Ideally) Understand the memory model (at least at a high level) to write correct multithreaded code, using synchronization primitives (like mutexes or atomic operations) when necessary to prevent data races.

The memory model essentially provides a contract between the hardware and the software. It allows for optimization (reordering) while guaranteeing correctness.

By allowing reordering, we unlock a large “space” of possible execution orderings. We went from a very strict model of sequential consistency and are now expanding it into a “weak memory”.

Pipelining: The Assembly Line of Computation

Let’s shift gears and talk about another fundamental technique for achieving parallelism: pipelining.

Think of an assembly line in a factory. Instead of building one car from start to finish before starting the next, the assembly line breaks the process down into stages. One station might install the engine, another the wheels, and so on. Multiple cars are in different stages of completion simultaneously.

Pipelining in a CPU is analogous to this. Instead of executing one instruction completely before starting the next, the CPU breaks down instruction execution into stages:

Fetch: Get the next instruction from memory.
Decode: Figure out what the instruction means (e.g., add, subtract, load, store).
Execute: Perform the actual operation (e.g., add two numbers).
Memory Access: If the instruction involves memory, access it (load or store data).
Write Back: Write the result back to a register.

Each of these stages is handled by a separate piece of hardware within the CPU. While one instruction is being fetched, another can be decoded, another executed, and so on.

The Washing Clothes Analogy

Let’s illustrate this with a simple analogy: washing clothes.

We have four stages:

Wash: Put the clothes in the washing machine.
Dry: Transfer them to the dryer.
Fold: Fold the clean clothes.
Store: Put them away in the closet.

If we do one load of laundry at a time, we have to wait for each stage to complete before starting the next. But if we have multiple loads, we can pipeline the process:

As soon as the first load is finished washing, we can move it to the dryer and start washing the second load. While the second load is washing, the first load is drying. We’re utilizing our resources (washing machine, dryer) more efficiently.

Balanced Pipelines

A balanced pipeline is one where each stage takes roughly the same amount of time. This is ideal because it avoids bottlenecks. If one stage is much slower than the others, it will hold up the entire pipeline.

This graph shows a balanced pipeline. Each row represents a different “input” (like a load of laundry), and each column represents a time unit. Notice how the stages overlap, and the pipeline is fully utilized after an initial “lead-in” period.

Throughput and Latency

Two key metrics characterize pipeline performance:

Throughput: The rate at which work is completed. In our laundry example, it’s the number of loads of laundry finished per hour. In a CPU, it’s the number of instructions completed per second.
Latency: The time it takes for a single task to complete. In our laundry example, it’s the time it takes to wash, dry, fold, and store one load of laundry. In a CPU, it’s the time it takes to execute a single instruction.

The throughput bound is determined by the slowest stage in the pipeline. No matter how fast the other stages are, the overall throughput cannot exceed the rate at which the slowest stage can process work. In the formula, max(computationtime(stages)) represents the time taken by the slowest stage.

The latency bound is the sum of the times taken by all the stages. Even though the pipeline processes multiple tasks concurrently, each individual task still has to go through every stage.

Unbalanced Pipelines and Optimization

Let’s revisit our laundry example, but this time with unbalanced stage times:

Wash: 5 seconds
Dry: 10 seconds
Fold: 5 seconds
Store: 10 seconds

If we simply pipeline this as is, the dryer and store stages will be bottlenecks. The washing machine and folding station will be idle much of the time, waiting for the slower stages to finish. The total time to wash a few loads will be 70 seconds, at first.

One way to “balance” the pipeline is to make all stages take as long as the slowest stage (10 seconds in this case). This will reduce the idle time, but it also increases the latency of each individual load.

Now the total time for all loads is 80 seconds.

A better approach is often to subdivide the slower stages. For example, we could get two dryers, one that takes 4 seconds (d1), and the other that takes 6 (d2). We also do that to the two “closets”:

This gives us a new set of times.

And, now, for latency analysis, we take the maximum. Finally we can update the table:

Now the total time is 60 seconds.

This illustrates a key trade-off:

Increasing the number of pipeline stages (making them more fine-grained) can improve throughput, but it often increases latency.

The CPU Pipeline: A Real-World Example

Let’s connect this back to the CPU. A classic RISC (Reduced Instruction Set Computing) CPU pipeline typically has five stages:

Instruction Fetch (IF): Get the next instruction from memory, using the program counter (instruction pointer). Modern CPUs might prefetch multiple instructions to anticipate future execution.
Instruction Decode (ID): Decode the instruction’s bit pattern (e.g., 01101...) into a meaningful operation (e.g., ADD R1, R2). This stage also identifies the registers involved.
Execute (EX): Perform the actual operation (e.g., add the contents of two registers). This stage might involve the Arithmetic Logic Unit (ALU).
Memory Access (MEM): If the instruction involves memory (e.g., loading data from memory or storing data to memory), perform the memory access.
Write Back (WB): Write the result of the operation back to a register. This might also involve updating status flags (e.g., carry, overflow).

By pipelining these stages, the CPU can achieve a significant speedup, even for sequential programs. While one instruction is being fetched, another can be decoded, and so on. This is a key example of how hardware parallelism is used to improve the performance of seemingly sequential code.

Parallel Programming: Basic Concepts

We’re now moving beyond the hardware details and starting to think about how we, as programmers, can express parallelism in our code. How do we take a problem and break it down so that it can be solved by multiple processors (or cores) simultaneously?

Recap: The Big Picture (Part I)

Let’s briefly recall the high-level view of a parallel system. We have:

Physical Memory: The actual RAM in the computer.
Processes (e.g., JVM): Isolated execution environments. Each process has its own memory space.
JVM Scheduler: The Java Virtual Machine’s scheduler, which manages JVM threads.
JVM Threads: Threads within a single JVM process. These share the process’s memory space (making communication easier, but also requiring careful synchronization).
OS (Operating System): Manages resources and schedules processes and OS threads.
OS Scheduler: The operating system’s scheduler, which assigns OS threads to CPU cores.
CPU Cores: The actual processing units that execute instructions.
Virtual Threads This maps many virtual threads to a few OS threads.

This hierarchy, from physical memory down to individual CPU cores, is what we’re working with. Our goal is to write programs that effectively utilize these resources.

Expressing Parallelism: Work Partitioning

The first key concept is work partitioning. This is the process of taking a single program’s workload and splitting it up into parallel tasks. A task is a unit of work that can be executed independently by a single processor (or thread).

There are two main approaches to work partitioning:

Explicit/Manual (Task/Thread Parallelism): The programmer explicitly defines the tasks and/or threads in their code. This gives the programmer fine-grained control, but it also places the burden of managing parallelism on them.
- Example: Creating multiple Thread objects in Java and assigning each thread a specific piece of work.
Implicit Parallelism: The system (e.g., the compiler, runtime library, or even the hardware) automatically identifies and exploits parallelism. The programmer expresses the desired operation at a higher level, and the system takes care of the details.
- Example: Using a parallel for loop in a language like OpenMP, where the compiler automatically divides the loop iterations among multiple threads. Or Data Parallelism.

Work Partitioning and Scheduling

Work partitioning is the what (what are the independent tasks?), and scheduling is the how (how are those tasks assigned to processors?).

Work Partitioning:
- Splitting up work into parallel tasks/threads.
- Typically done by the programmer (in the case of explicit parallelism).
- A task represents a unit of work.
- Also called task/thread decomposition.
Scheduling:
- Assigning tasks to processors (or cores).
- Typically done by the system (operating system or runtime library).
- The goal is full utilization: keeping all processors busy as much as possible, to maximize performance. No processor should ever be idle.

Ideally, the number of chunks (tasks) created during work partitioning should be larger than the number of processors. This gives the scheduler more flexibility to distribute the work evenly and keep all processors busy.

Task/Thread Granularity

Granularity refers to the size of the tasks.

Coarse Granularity: A few, large tasks. Think of dividing a large project among a small team of people.
Fine Granularity: Many, small tasks. Think of dividing a large project among a large crowd of people, where each person does a very small piece.
- More Portable: Can be executed more easily on machines with a larger number of processors. If you have many small tasks, it’s easier to distribute them across many cores.
- Better for Scheduling: Gives the scheduler more flexibility to balance the load.
- But: If the overhead of scheduling and managing a task is comparable to the time it takes to execute the task, then the overhead dominates, and performance suffers.

Task Granularity Guidelines:

Tasks should be as small as possible (to maximize parallelism).
But, they should be significantly larger than the scheduling overhead. System designers strive to make these overheads as small as possible, but they can never be completely eliminated.

The ideal granularity is a trade-off. Too coarse, and you don’t exploit all available parallelism. Too fine, and the overhead kills performance.

Scalability

Scalability is a somewhat overloaded term, but in the context of parallel programming, it refers to how well a program’s performance improves as we increase the number of processors.

Speedup: We want to see a speedup when we add more processors.
Ideal: What happens if we have an infinite number of processors ( $p \to \infty$ )?
Linear Speedup: A program scales linearly if doubling the number of processors doubles the performance. This is the ideal, but rarely achievable in practice.

Parallel Performance Metrics

Let’s define some key metrics:

$T_{1}$ : Sequential execution time (the time it takes to run the program on a single processor).
$T_{p}$ : Execution time on $p$ processors.

Ideally, $T_{p} = T_{1} / p$ . This is perfect speedup – we divide the work perfectly among the processors.

In reality, $T_{p} > T_{1} / p$ . We always have some performance loss due to overhead (communication, synchronization, etc.).

It’s theoretically possible (though rare) to have $T_{p} < T_{1} / p$ . This is called super-linear speedup, and it usually indicates that something unusual is happening (e.g., the parallel version is exploiting cache effects more effectively than the sequential version).

Speedup ( $S_{p}$ ): $S_{p} = T_{1} / T_{p}$ . This measures how much faster the parallel version is compared to the sequential version.
- $S_{p} = p$ : Linear speedup (perfection).
- $S_{p} < p$ : Sub-linear speedup (performance loss).
- $S_{p} > p$ : Super-linear speedup (sorcery!).
Efficiency $S_{p} / p$

There’s a key distinction to be aware of:

Relative Speedup: Compares the parallel execution time ( $T_{p}$ ) to the execution time of the serialized version of the same parallel algorithm ( $T_{1}$ ). This is the most common definition.
Absolute Speedup: Compares the parallel execution time ( $T_{p}$ ) to the execution time of the best known sequential algorithm for the problem. This is a fairer comparison, because sometimes the best sequential algorithm is different from the parallel algorithm. Using a poor baseline (a slow sequential algorithm) can artificially inflate the speedup and efficiency.

The above is a visual representation of linear, sequential and parallel speedup.

Why Don’t We Achieve Perfect Speedup? ( $S_{p} < p$ )

There are several reasons why we rarely achieve perfect linear speedup:

Insufficient Parallelism: The program itself may not contain enough parallelism. Some parts of the program might be inherently sequential.
Overheads: Parallelization introduces overheads, primarily due to:
- Communication: Processors (or threads) need to communicate to exchange data and coordinate their work.
- Synchronization: Mechanisms like mutexes and locks are needed to protect shared data, but they introduce overhead.
Architectural Limitations: Even if the program has enough parallelism and the overheads are low, the hardware itself might have limitations, such as:
- Memory Contention: Multiple processors trying to access the same memory location simultaneously can create bottlenecks.
- Limited Bandwidth: The communication channels between processors (or between processors and memory) have limited bandwidth.

Amdahl’s Law: The Limit of Parallel Speedup

Let’s consider a parallel program where:

20% of the program is sequential (cannot be parallelized).
80% of the program is parallel (and we assume it scales perfectly).
$T_{1} = 10$ (the total execution time on a single processor).

What is $T_{8}$ (the execution time on 8 processors)? What is the speedup ( $S_{8}$ )?

The sequential portion will always take 20% of the sequential time ( $T_{1}$ )
$T_{1} = 10$ .
$T_{8} = 0.2 \cdot T_{1} + 0.8 \cdot T_{1} /8 = 2 + 1 = 3$ .
$S_{8} = T_{1} / T_{8} = 10/3 = 3.33$ .

Notice that even though 80% of the program scales perfectly, we only achieve a speedup of 3.33 on 8 processors. The sequential portion (20%) limits the overall speedup.

This leads us to Amdahl’s Law, a fundamental principle in parallel computing:

“…the effort expended on achieving high parallel processing rates is wasted unless it is accompanied by achievements in sequential processing rates of very nearly the same magnitude.” - Gene Amdahl

Amdahl’s Law formalizes the observation that the sequential portion of a program limits the potential speedup achievable through parallelism.

Let:

$W_{ser}$ : Time spent doing non-parallelizable (serial) work.
$W_{p a r}$ : Time spent doing parallelizable work.

Then:

$T_{1} = W_{ser} + W_{p a r}$ (total sequential execution time).
$T_{p} \geq W_{ser} + W_{p a r} / p$ (execution time on $p$ processors). The parallelizable work is divided among $p$ processors, but the serial work remains unchanged.

From this, we can derive Amdahl’s Law:

S_{p} \leq \frac{W _{ser} + W _{p a r}}{W _{ser} + W _{p a r} / p}

Let $f$ be the fraction of the total work that is non-parallelizable (serial). Then:

$W_{ser} = f \cdot T_{1}$ .
$W_{p a r} = (1 - f) \cdot T_{1}$ .

Substituting these into Amdahl’s Law, we get:

S_{p} \leq \frac{1}{f + ( 1 - f ) / p}

Continue here: 08 Divide & Conquer and Executor Service

CS Notes

Explorer

07 Basic Concepts in Parallelism

Digital Character AI

Historical Context: Eliza

Modern Embodiments: Digital Einstein

Key Components of a Digital Character AI

Recap: Locality and ILP

Reordering Instructions

The Importance of Memory Models

Pipelining: The Assembly Line of Computation

The Washing Clothes Analogy

Balanced Pipelines

Throughput and Latency

Unbalanced Pipelines and Optimization

The CPU Pipeline: A Real-World Example

Parallel Programming: Basic Concepts

Recap: The Big Picture (Part I)

Expressing Parallelism: Work Partitioning

Work Partitioning and Scheduling

Task/Thread Granularity

Scalability

Parallel Performance Metrics

Why Don’t We Achieve Perfect Speedup? ( $S_{p} < p$ )

Amdahl’s Law: The Limit of Parallel Speedup

Table of Contents

Graph View

CS Notes

Explorer

07 Basic Concepts in Parallelism

Digital Character AI

Historical Context: Eliza

Modern Embodiments: Digital Einstein

Key Components of a Digital Character AI

Recap: Locality and ILP

Reordering Instructions

The Importance of Memory Models

Pipelining: The Assembly Line of Computation

The Washing Clothes Analogy

Balanced Pipelines

Throughput and Latency

Unbalanced Pipelines and Optimization

The CPU Pipeline: A Real-World Example

Parallel Programming: Basic Concepts

Recap: The Big Picture (Part I)

Expressing Parallelism: Work Partitioning

Work Partitioning and Scheduling

Task/Thread Granularity

Scalability

Parallel Performance Metrics

Why Don’t We Achieve Perfect Speedup? (Sp​<p)

Amdahl’s Law: The Limit of Parallel Speedup

Table of Contents

Graph View

Why Don’t We Achieve Perfect Speedup? ( $S_{p} < p$ )