06 Parallel Architectures - Parallelism on the Hardware Level

Lecture from 05.03.2024 | Video: Video ETHZ

`wait`, `notify`, `notifyAll`

Example

Continuation…

Producer-Consumer: v3 (uses wait() and notifyAll()):

buffer.wait(): The consumer thread goes to sleep (releases the lock on buffer) and waits to be notified.
buffer.notifyAll(): Wakes up all the waiting threads.

This solution works well…

Beyond Synchronization

So if we want to go beyond synchronization, we use wait, notify and notifyall:

wait(): Releases the object’s lock, and the thread waits on an internal queue.
notify(): Wakes up the highest-priority waiting thread.
notifyAll(): Wakes up all waiting threads. The threads then compete for the lock non-deterministically. wait(), notify(), and notifyAll() can only be called when the object is locked.

Pitfalls

Why loop is needed:

Why synchronized is needed:

Nested Lockout Problem

If a synchronized method contains blocking code (e.g., wait()), it can lead to deadlock. There’s no simple, universal solution to this problem. You may need to avoid blocking calls within synchronized methods or provide non-synchronized alternatives.

Parallel Architectures

This part delves into the hardware aspects that enable parallel processing. We will explore how computer architecture has evolved to facilitate the execution of multiple tasks simultaneously, leading to improved performance. We’ll also discuss how these hardware choices influence software development.

Parallelism vs. Concurrency (Recap)

It’s important to clarify the often-interchanged terms “parallelism” and “concurrency”:

Parallelism: This is about using extra hardware resources to speed up a specific computation. The goal is faster execution.
Concurrency: This is about correctly and efficiently managing access to shared resources among different tasks. The goal is to handle multiple tasks seemingly at the same time, even if not truly simultaneously.

While these terms are often used interchangeably, understanding the nuance helps to frame the design and architectural choices we will explore.

Parallel and Concurrent vs. Distributed (Preview)

We’ll briefly touch upon the differences between parallel/concurrent computing and distributed computing:

Parallel and Concurrent Computing: Typically operate under the assumption of a single, unified “system,” focusing on managing cores and memory within a single computer.
Distributed Computing: Involves physically separate systems, administrative boundaries, and different domains, resulting in significant challenges regarding reliability, fault tolerance, and inter-system communication.

Motivation for Learning about Parallel Architectures

Why should software developers care about the underlying hardware? The lecture is designed to give you a high-level intuition for:

Architectural Challenges & Choices: An understanding of the constraints and trade-offs faced by hardware architects.
Why Multicores? Understand the limitations of single-core performance improvements and why multicores are important.
Performance Implications: Understanding the importance of concepts like caches and data locality. How should the architecture influence the software?
Transfer of Challenges: Some of the challenges of hardware design also manifest in software and influence how we need to design a parallel program.

A Glimpse Inside: From Von Neumann to Memory Hierarchies

Today’s computers come in many forms.

This can include laptops, PCs, phones, or even game consoles.

However, underneath it all, they’re often based on a similar design. Here a motherboard, with its many connectors:

We’ll explore the basic principles upon which most computers are built and see how the architecture has adapted to enable parallelism.

Basic Principles: Von Neumann Architecture

Most computers are based on the Von Neumann architecture, program data and program instructions share the same memory space. While a cornerstone of modern computing, this architecture also has limitations.

The ENIAC used the Harvard architecture, and represented a vastly different approach to general purpose computing.
The Von Neumann is much simpler however, because it only uses one address space (for data and code), and it only requires one bus.

It’s also worth noting that imperative programming languages align with Von Neumann:

Imperative progrmaming languages use statements to dictate a specific sequential series of steps to take, which “matches” how the hardware functions.
One must then question whether it is “natural”, or whether it is simply tailored to the hardware (as opposed to functional languages).
A graphic depicting the memory and CPU:
John Backus, inventor of FORTRAN, also questioned whether “conventional programming can be liberated from the von Neumann style.”

CPU Caches: Bridging the Memory Gap

CPUs became increasingly fast, but Memory got much bigger. Accessing memory becomes a bottleneck. One method of solving it is to add CPU caches.

To address this, systems incorporate CPU Caches:

Locality: Because of data locality, data is often related spatial and temporally. This helps optimize code
Encapsulation: This refers to a reason locally, which in turn helps each thread to function a time.

CPU and Memory Hierarchies

A typical CPU and memory hierarchy, without multi core architectures:

Because speed and memory come with a tradeoff, CPU and Memory are combined in a hierarchy:

Higher Caches are Smaller: Memory is expensive, meaning higher caches tend to have lower space.
Higher Caches are Faster: Speed is key, and higher caches come at the expense of memory space.

Multi-Core CPUs and Hierarchies

A typical CPU and memory hierarchy, with multi core architecture:

Modern, multi-core CPUs have caches per core, which increases the memory hierarchy. It becomes extremely complicated to find all memory objects in such a huge range.

General View

Caches are: Faster, smaller, and organized in multi-level hierarchies.

Cache Coherency Protocol (with Code examples)

It can be useful to visualize how variables may occur across memory, caches, and CPUs.

The cache coherency protocol ensures that data consistency. However, this also means that:

The CPU can postpone writes and use prefetchers, which results in optimizations.

To this end, memory barriers are placed:

In Java: sync

Cache Misses

To that end, a cache is often used, which is a cache line that stores data for later access.

A cache miss thus occurs, meaning a program is going to require something in memory, and it is not going to be available within the cache. This requires much more memory to be loaded, and in turn a hit for Java applications.

Beyond Single-Core: Exploiting Parallelism

Approaches to increase parallelism

How to make computations faster (on hardware level)? This requires parallel execution and additional execution units. There are typically 3 approaches.

Vectorization

Applies a single instruction to multiple data. SIMD, or Single instruction multiple data, is used. For example, adding two vectors:

Load register
Operation
Store register

The old way is one at a time, the new way is at N-at-a-time.

Embarrassingly parallel
Backed up by hardware and vectorized

Instruction Level Parallelism (ILP)

Here, we’re talking about CPU level parallelism. The instruction stream (sequential program) allows for parallel execution.

Each of the instructions will go to a different hardware unit:

Superscalar CPUs
Speculative Execution
Out-of-Order Execution

Example 1

Shows the ordering of the execution process:

Given this code this is what happens:

Example 2

In our example here the first three statements are executed in parallel, the latter one is run speculatively since they are independent. If p is true then we are good and can immediately continue. If not, we need to calculate z which has dependencies.

Continue here: 07 Basic Concepts in Parallelism

CS Notes

Explorer

06 Parallel Architectures - Parallelism on the Hardware Level

`wait`, `notify`, `notifyAll`

Example

Beyond Synchronization

Pitfalls

Nested Lockout Problem

Parallel Architectures

Parallelism vs. Concurrency (Recap)

Parallel and Concurrent vs. Distributed (Preview)

Motivation for Learning about Parallel Architectures

A Glimpse Inside: From Von Neumann to Memory Hierarchies

Basic Principles: Von Neumann Architecture

CPU Caches: Bridging the Memory Gap

CPU and Memory Hierarchies

Multi-Core CPUs and Hierarchies

General View

Cache Coherency Protocol (with Code examples)

Cache Misses

Beyond Single-Core: Exploiting Parallelism

Approaches to increase parallelism

Vectorization

Instruction Level Parallelism (ILP)

Example 1

Example 2

Table of Contents

Graph View

CS Notes

Explorer

06 Parallel Architectures - Parallelism on the Hardware Level

wait, notify, notifyAll

Example

Beyond Synchronization

Pitfalls

Nested Lockout Problem

Parallel Architectures

Parallelism vs. Concurrency (Recap)

Parallel and Concurrent vs. Distributed (Preview)

Motivation for Learning about Parallel Architectures

A Glimpse Inside: From Von Neumann to Memory Hierarchies

Basic Principles: Von Neumann Architecture

CPU Caches: Bridging the Memory Gap

CPU and Memory Hierarchies

Multi-Core CPUs and Hierarchies

General View

Cache Coherency Protocol (with Code examples)

Cache Misses

Beyond Single-Core: Exploiting Parallelism

Approaches to increase parallelism

Vectorization

Instruction Level Parallelism (ILP)

Example 1

Example 2

Table of Contents

Graph View

`wait`, `notify`, `notifyAll`