Lecture from 28.05.2025 | Video: Video ETHZ

Welcome to our final lecture on Parallel Programming. We will conclude our discussion on Transactional Memory (TM) implementation details and then shift focus to Consensus, a fundamental theoretical concept that helps us understand the power hierarchy of atomic operations. Finally, we’ll briefly introduce Distributed Memory programming via Message Passing as an alternative paradigm to shared memory concurrency, touching upon the Message Passing Interface (MPI) standard.

Last Time

In our previous session, we covered:

  • Consensus (Introduction): We motivated the need to classify the power of atomic operations and introduced the concept of the Consensus Number. We stated the key results that atomic registers have consensus number 1, while Compare-and-Swap (CAS) has consensus number .
  • Transactions (Introduction): We motivated Transactional Memory (TM) as a potential solution to the difficulties of both lock-based and lock-free programming. We discussed the core goals (atomicity, isolation), potential benefits (simplicity, composability, optimism), and introduced the basic concepts using Scala-STM as an example, including the retry mechanism for conditional waiting. We also began exploring a conceptual clock-based STM implementation.

Learning Goals for Today

Building directly on last time, our goals are:

  1. Transactional Memory:
    • Finish outlining the conceptual clock-based STM implementation (commit phase).
    • Illustrate TM’s utility with the Dining Philosophers example.
    • Discuss practical issues and limitations of TM.
  2. Consensus (Hierarchy):
    • Formally define the wait-free Consensus problem and Consensus Number.
    • Provide the proof sketches establishing the consensus numbers for atomic registers and CAS.
    • Demonstrate the importance of the consensus hierarchy using an impossibility proof (wait-free queue from atomic registers).
  3. Distributed Memory & Message Passing:
    • Contrast shared memory and distributed memory architectures and programming models.
    • Introduce the core ideas of message passing: isolated state and explicit communication.
    • Briefly mention programming models (CSP, Actors, Go) and the MPI standard.
    • (Note: Due to time constraints, the MPI coverage will be introductory).

Transactional Memory Implementation (Continued)

We previously outlined a conceptual clock-based Software Transactional Memory (STM) system involving transaction birthdates, object timestamps, read-sets, and write-sets. Let’s complete the picture by looking at the commit phase.

Recall the validation requirement: for a transaction T to commit successfully, all objects it read (in its read-set) must still have timestamps less than or equal to T’s birthdate at the moment of commit validation. This ensures that no conflicting write by another transaction occurred and committed after T started but before T attempts to commit. The timeline shows T reading Y, X, and Z. If Z’s timestamp (Z.date) is updated by another committing transaction to be greater than T’s birthdate before T commits, T must abort upon validation (indicated by the lightning bolt).

Commit Phase (Conceptual Clock-Based STM)

When a transaction T attempts to commit, a typical blocking, clock-based STM performs the following steps:

  1. Acquire Locks: Lock all objects present in both the transaction’s read-set and its write-set. This must be done in a globally consistent order (e.g., by object address or ID) to avoid deadlock between committing transactions. This locking ensures that no other transaction can interfere during validation and the write-back phase.
  2. Validate Read-Set: Check that for every object obj in the read-set, obj.timestamp <= T.birthdate. If this check fails for any object, it means a conflicting write has occurred. The transaction must abort: release all acquired locks and discard the write-set.
  3. Get Commit Timestamp: If validation succeeds, increment the global version clock and get its new value, T_commit. This will be the timestamp for the changes made by this transaction.
  4. Apply Write-Set: For each object obj in the write-set, copy the locally buffered changes back to the actual object in global memory and update obj.timestamp = T_commit.
  5. Release Locks: Release all locks acquired in step 1. The transaction has successfully committed.

This timeline illustrates a successful commit. Transaction T reads Y and X. It also writes to Y and X (creating local copies in its write-set). At commit time, it locks Y and X, validates their timestamps (which are still T’s birthdate), gets a new commit timestamp, writes its local copies back to Y and X with the new timestamp, and releases the locks.

This timeline illustrates an aborted commit. T reads Y and X, writes locally to Y and Z. Before T attempts to commit, another transaction commits and updates the timestamp of Z (Z.date) to be greater than T’s birthdate. When T attempts to commit, it locks Y and Z. During validation (step 2), it detects that Z.timestamp > T.birthdate. The validation fails, T releases the locks, discards its write-set (local copies of Y and Z), and aborts.

Example: Dining Philosophers Using TM

Transactional Memory aims to simplify classic concurrency problems. Let’s revisit the Dining Philosophers.

5 philosophers sit at a round table with 5 forks placed between them. Each philosopher alternates between thinking and eating. To eat, a philosopher needs exclusive access to both the fork on their left and the fork on their right. Forks cannot be shared simultaneously.

A naive solution where each philosopher locks their left fork then their right fork leads to deadlock if all philosophers pick up their left fork simultaneously. Solutions involve complex strategies like lock ordering or breaking symmetry.

Now, let’s solve it using a reference-based STM (like Scala-STM).

  1. Represent Forks: Each Fork object contains a transactional reference Ref.View<Boolean> called inUse, initialized to false. STM.newRef(false) creates this transactional variable.
  2. Philosopher State: Each PhilosopherThread holds references to its left and right Fork objects.

  • We create an array of Fork objects.
  • We create an array of PhilosopherThread objects, assigning adjacent forks (with wrap-around for the last philosopher) to each thread.

  • The run method is simple: loop for a number of meals, // THINK, pickUpBothForks(), // EAT, putDownForks().

  • pickUpBothForks(): This is where the magic happens. The entire action of checking and acquiring both forks is wrapped in STM.atomic { ... }.
    • Inside the transaction: It reads left.inUse.get() and right.inUse.get().
    • Condition Check: if (left.inUse.get() || right.inUse.get()) checks if either fork is already in use.
    • Retry: If either fork is busy, it calls STM.retry(). This aborts the current transaction attempt and puts the philosopher thread to sleep. The STM system will automatically wake and retry the transaction later when one of the variables read in this transaction (i.e., left.inUse or right.inUse) is modified by another committing transaction (specifically, when a fork is put down).
    • Acquire: If both forks are free (get() returned false for both), the transaction proceeds to set left.inUse.set(true) and right.inUse.set(true).
    • Commit: If the transaction reaches the end of the atomic block without retrying or aborting due to external conflicts, the STM system commits the changes (setting both inUse flags to true) atomically.

  • putDownForks(): This is also wrapped in STM.atomic. It simply sets both left.inUse and right.inUse back to false. This commit will wake up any neighboring philosophers who had previously called retry() because they were waiting for one of these specific forks.

TM Advantage: The programmer simply stated the atomic requirement: “check both forks and acquire both if available, otherwise wait”. The complexities of acquiring multiple resources without deadlock, and the conditional waiting, are handled automatically by the STM system’s atomicity guarantee and the retry mechanism. Compare this to the manual lock ordering needed in the traditional solution.

Issues and Challenges with Transactions

While conceptually appealing, TM is not a silver bullet and faces challenges:

  • Semantics: There’s no universal agreement on the “best” semantics, especially for features like nesting (flattened vs. closed vs. open) and isolation levels (strong vs. weak, interaction with non-transactional code).
  • Performance: Achieving good performance can be very difficult. STM systems introduce overhead for tracking read/write sets, version management, conflict detection, logging, and potential rollbacks. HTM has resource limitations. Performance is often highly dependent on the conflict rate.
  • I/O Operations: How do you handle operations with irreversible external side effects (like printing to the screen, sending a network packet, launching a missile) inside a transaction that might need to be aborted and rolled back? This is a major difficulty. Solutions often involve buffering I/O until commit (complex) or simply disallowing I/O within transactions.
  • Summary: Locks are hard; lock-free is often harder. TM attempts to simplify things by abstracting synchronization. However, TM implementations (STM/HTM) are still relatively immature, performance is challenging, and fundamental issues like I/O remain. It has not yet become a mainstream, widely adopted programming model, though it sees use in certain language communities (e.g., Haskell, Clojure) and remains an active research area.

Distributed Memory & Message Passing

We now shift gears from shared memory concurrency (threads, locks, atomics, TM operating on shared variables) to an alternative model: Distributed Memory programming using Message Passing. Note this is a very short intro into this topic…

Recap: Shared Memory Model

So far, we’ve considered parallel/concurrent execution primarily through threads (fork-join) operating within a shared memory space, using various techniques (locking, lock-free, transactional, semaphores, monitors) to coordinate access to shared mutable state.

The core difficulty in shared memory programming stems from managing access to this shared mutable state. Locks are complex, race conditions are easy to introduce.

Question: Can we design parallel programs by avoiding direct sharing of mutable state?

Two Main Alternatives:

  1. Functional Programming: Primarily uses immutable state. If data structures cannot change, concurrent reads require no synchronization. Parallelism often comes from applying functions in parallel over immutable data collections.
  2. Message Passing: Allows mutable state, but this state is isolated and private to each concurrent task (often called a process, actor, or task). Tasks do not directly access each other’s memory. Instead, they cooperate and exchange data by explicitly sending and receiving messages.

Concurrent Message Passing Models/Systems:

  • Models: CSP (Communicating Sequential Processes), Actor Model (popularized by Erlang, used in libraries like Akka). The Go language’s concurrency model (goroutines and channels) is heavily inspired by these.
  • Library Standard: MPI (Message Passing Interface) is the dominant standard library for high-performance computing on distributed memory systems.

Shared vs. Distributed Memory Architectures

  • Shared Memory: Multiple CPUs share access to a common pool of memory. This is typical of multi-core processors in a single machine (SMP - Symmetric Multiprocessing).
  • Distributed Memory: Consists of multiple nodes, each with its own private CPU(s) and memory. Nodes are connected by an interconnect network. Accessing data on another node requires sending a message over the network. This is the architecture of clusters and supercomputers.

Message passing is the natural programming model for distributed memory hardware, but it can also be used effectively within a shared memory machine (where messages are passed via shared memory buffers instead of a network).

Isolated Mutable State via Message Passing

This diagram illustrates the core concept. Each circle represents a task (process/thread/actor) with its own private internal state. Tasks interact only by sending messages along the communication channels (arrows). There is no direct access or modification of another task’s state.

Example: Shared vs. Isolated Counting

In the shared state model, multiple threads trying to increment a counter need to synchronize access (lock or atomic) to the single shared counter variable.

In the isolated state model, each of the four tasks maintains its own private Local cnt. Increment operations are purely local and require no inter-task synchronization.

To get the global sum in the isolated model, communication is required. A task might send “get count” messages to all others, collect replies, and sum them. Or a more structured collective communication pattern (like a reduction) could be used. Reading the “global” state involves explicit message exchange.

Example: Rethinking the Bank Account

  • Sequential: One balance.
  • Shared Parallel: One shared balance requiring locks/atomics (potential bottleneck).
  • Distributed Parallel (Message Passing): Each task (e.g., representing a branch or user session) holds a local balance or budget. Operations primarily affect the local balance. Transfers between tasks require sending messages (e.g., Task A sends a “debit A, credit B by $X” message to Task B). Communication is explicit and happens only when necessary for cross-task interaction.

This depicts three tasks, each managing its local balance independently, interacting only via messages (not shown) when needed.

Synchronous vs. Asynchronous Messages

Message passing can be:

  • Synchronous: send blocks until the receiver accepts the message. Provides tight synchronization.
  • Asynchronous: send returns immediately after placing the message in a buffer (e.g., mailbox). Sender and receiver are decoupled. Requires buffer management.

Go Language Concurrency Model

Go provides first-class language support for message-passing style concurrency:

  • Goroutines: Lightweight concurrent functions managed by the Go runtime.
  • Channels: Typed communication pipes between goroutines.
    • Unbuffered (default): Synchronous sends/receives.
    • Buffered: Asynchronous sends/receives up to buffer capacity.

This example shows creating channels, starting a goroutine (go hello(...)), sending messages (msgs <- "Hello"), receiving messages (msg := <-msgs), and synchronizing completion (<-done). The use of synchronous channels dictates the execution flow.

This second example demonstrates how misuse of synchronous channels (waiting to send when the receiver is waiting to receive from a different channel) can lead to deadlock.

Example: Concurrent Prime Sieve Pipeline

A pipeline is a natural fit for message passing. The concurrent prime sieve works by chaining filter tasks: Generator Filter(2) Filter(3) Filter(5) …

The Go code implements this using channels. Generate sends numbers. Filter receives, filters multiples of its prime, and sends remaining numbers to the out channel. main dynamically creates the pipeline stage by stage.

Message Passing Interface (MPI)

While Go integrates message passing into the language, the standard for high-performance, large-scale parallel computing on distributed memory systems is the Message Passing Interface (MPI) library standard.

  • Standard API: Defines functions for sending/receiving messages, collective operations, process management, etc.
  • Library: Implemented as libraries for C/C++/Fortran etc.
  • Portability: Hides hardware details, runs over TCP/IP or specialized networks.

Core MPI Concepts

  • Processes & Groups: An MPI execution consists of multiple independent processes. Processes can be organized into groups.
  • Communicators: A communicator object defines a communication context, encompassing a specific group of processes that are allowed to communicate with each other using that communicator. MPI_COMM_WORLD is the initial communicator containing all processes.
  • Rank: Within a communicator, each process has a unique integer ID from 0 to size-1, called its rank.

Processes use their rank to differentiate their behavior in the SPMD model.

MPI programs typically follow an Init -> Get Size/Rank -> Compute/Communicate -> Finalize structure.

SPMD Model

Most MPI programs use the Single Program, Multiple Data (SPMD) model. The same compiled code runs on all processes, but if (rank == ...) statements control execution paths based on the process rank.

Basic Communication

MPI provides functions like MPI_Send and MPI_Recv for point-to-point communication. Key parameters include the data buffer, count, datatype, destination/source rank, tag, and communicator.

Example: Parallel Sort

A simple parallel sort can use MPI_Send and MPI_Recv to distribute data, sort locally, and merge results, illustrating basic message passing coordination.

Course Conclusion and Final Remarks

As we conclude this course on Parallel Programming, let’s reflect on the broader context and significance of what we’ve learned, and look ahead.

Why Does Performance Matter? The relentless pursuit of computational speed isn’t just an academic exercise. High-performance computing (HPC) has firmly established itself as a cornerstone of modern scientific discovery and technological innovation, often referred to as the “third pillar of science” alongside theory and experimentation.

From simulating the intricate folding of proteins for drug discovery (like understanding the HIV capsid) or accelerating vaccine research using AI, to modeling complex physical phenomena…

…like detailed weather forecasting crucial for daily life and disaster preparedness (where ETH and CSCS have made significant contributions using GPU acceleration)…

…or tackling grand challenges like building accurate digital twins of our entire planet to understand and mitigate climate change, the ability to compute faster and at larger scales directly translates into deeper insights and faster progress.

Indeed, the rapid advancements, particularly in AI – with models now passing medical licensing exams, bar exams, and coding interviews – underscore the transformative power unlocked by massive computation, prompting ongoing discussions about the future of human roles alongside increasingly capable machines.

The Exponential Pace (and Challenges) of Performance: The history of computing performance is staggering.

Achieving one Teraflop (10^12 floating-point operations per second) required a $67 Million supercomputer, ASCI Red, occupying a large room in 1997.

Just 17 years later, in 2014, Intel’s Xeon Phi co-processor offered similar performance on a single PCIe card for a few hundred dollars. And today (Update 2024), a single NVIDIA H100 GPU delivers nearly 1000 TFlop/s in TF32 precision for AI tasks, and almost 4000 TFlop/s in lower FP8 precision – an astonishing increase.

By 2017, the promise of a Teraflop was even reaching mainstream desktop CPUs.

The prediction from 2015 that we’d see a Teraflop in mobile devices within about 7 years (by 2022) turned out to be remarkably accurate, with high-end smartphone chips achieving this level of performance by 2023.

However, this incredible exponential progress (“Moore’s Law” and related scaling) is facing significant headwinds. While we might have once extrapolated towards Petaflops (10^15) on single devices in the near future, reality is proving more complex. Performance scaling is becoming interesting again because the easy paths are ending.

Dennard scaling (reducing voltage/power with transistor size) has largely ended, and fundamental physical limits (like the Landauer limit on energy per bit flip) are becoming relevant. The energy cost of computation, particularly moving data, is a critical bottleneck. This leads to the “Three Ls of modern computing”: optimizing for Spatial Locality (using data nearby), Temporal Locality (reusing data recently accessed), and Control Locality (efficient instruction flow, relevant for architectures like dataflow). Addressing these locality challenges is key to future performance gains.

Architectural shifts, like exploring dataflow models (which promise significantly lower energy per operation by routing data directly between functional units) compared to traditional load-store Von Neumann architectures (where much energy is spent moving data between memory and registers), reflect attempts to overcome these bottlenecks.

Ultimately, with massively parallel processing elements (SIMD units, GPU cores), High-Performance Computing has truly become a data management challenge. Efficiently feeding the compute units and minimizing data movement is paramount.

Modern HPC systems are complex integrations of many technologies: fast CPUs/Multicore/SMP, accelerators (GPUs/FPGAs), high-speed networking (RDMA), efficient floating-point units, and sophisticated software stacks, all working together like a high-performance racing car.

Measuring and Comparing Performance:

The Top500 list provides a long-standing (though sometimes criticized) benchmark for the world’s fastest supercomputers, primarily based on solving a large dense linear system (Ax=b) using the Linpack benchmark. While not perfectly representative of all applications, it offers valuable historical data and drives competition and innovation in the HPC community.

As of the latest list (June 2024), machines like Frontier remain at the top, while new systems like the ALPS supercomputer at the Swiss National Supercomputing Centre (CSCS) represent the state-of-the-art, combining traditional HPC capabilities with massive AI power using advanced hardware like NVIDIA’s Grace CPU and Hopper GPUs. Access to such machines enables cutting-edge research.

Programming these large machines often involves using MPI to coordinate work across thousands of nodes, as demonstrated by this example of a parallel Pi computation running on a supercomputer, scaling the problem across increasing numbers of processors.

Getting Involved:

For students interested in gaining hands-on HPC experience, the Student Cluster Competition (SCC), held at major conferences like ISC (Germany) and SC (USA), is an incredible opportunity. Teams of undergraduate students design, build, and operate a small cluster, running real scientific applications under challenging constraints (like power limits) over a non-stop 48-hour period. ETH has a successful team, “Racklette” (https://racklette.ethz.ch/) and CSCS provides support. Participating offers invaluable practical skills and exposure to the HPC community.

This was the last lecture…