11 Parallel Algorithms (Part II)

Lecture from: 25.03.2024 | Video: Video ETHZ

Faces4Med: Digital Faces in Medical Applications

While we’ve explored digital faces and avatars primarily through the lens of entertainment and general interaction, this section delves into their specific and often complex applications within the medical domain. The use of digital 3D models and visualization of faces for medical purposes is not new; research in this area dates back 30-40 years.

Early Work and Datasets

Early pioneers recognized the potential of using 3D visualizations derived from medical data for analysis and planning.

The images above showcase some of this early work, demonstrating a 3D head model reconstructed from medical imaging data. Notably, one of the first widely available datasets for this type of research emerged around 1996, providing a crucial resource for the community. These early models even attempted to simulate the elastic properties of the soft tissues between the skin and the skull. However, progress during this era was significantly constrained by limitations in both data acquisition techniques and available computing power.

Evolution Towards Modern Techniques

This foundational research paved the way for the sophisticated face technologies we see today, including high-definition scanning, generative AI for face creation, physics-based animation, and audio-driven facial movements.

Medical vs. Entertainment: The Importance of Internal Structures

A key distinction arises between entertainment and medical applications. While the entertainment industry primarily focuses on accurately modeling the outer appearance (skin texture, wrinkles, expressions), medical applications often require detailed modeling of the internal structures – muscles, fat, and crucially, the underlying bone. This internal fidelity is essential for accurate surgical simulation and planning.

Here is an example illustrating the modeling of these distinct anatomical layers:

Inferring and Modeling Anatomy

In many medical scenarios, it’s beneficial to infer the bone structure from external surface scans (like optical 3D scans) rather than relying solely on CT or MRI data, which might involve radiation or higher costs. Research explores methods to predict the underlying skeletal shape from facial surface data:

Discretization for Simulation

To apply physics-based simulations (e.g., Finite Element Method - FEM) that predict how tissues deform or how surgery will affect appearance, the continuous anatomical models must be converted into discrete representations, typically meshes (like tetrahedral or hexahedral meshes):

This discretization allows computers to simulate physical forces and predict outcomes.

Medical Use Cases

1. Dental Applications (e.g., Aligners): The image above demonstrates how these techniques can be applied in orthodontics. By modeling the teeth, jaw, and soft tissues, simulations can predict how treatments like dental aligners will not only straighten teeth but also affect the patient’s overall facial appearance.

2. Surgical Simulation (e.g., Jaw Surgery): Predicting the outcome of procedures like corrective jaw surgery (orthognathic surgery) is another vital application. Simulating changes to the jaw bone structure allows surgeons and patients to visualize the potential impact on facial aesthetics and function before the operation takes place.

These simulations provide valuable tools for diagnosis, treatment planning, and patient communication in various medical fields.

ForkJoin Framework & Task Parallel Algorithms

Analyzing Parallel Algorithms

Just like sequential algorithms, parallel algorithms should be both correct and efficient. We’ll focus on analyzing efficiency, assuming correctness.

Asymptotic Bounds: We want to analyze the algorithm’s performance in terms of big-O notation, independent of the specific number of processors.
ForkJoin Framework Guarantee: The ForkJoin Framework provides an expected-time guarantee that is asymptotically optimal, given the available number of processors and assuming the programmer follows certain guidelines.

Work and Span (Recap)

$T_{p}$ : Running time on $p$ processors.
$T_{1}$ (Work): Total amount of work (time to execute on a single processor). Sum of the run-times of all nodes in the DAG.
$T_{\infty}$ (Span/Critical Path): Time to execute on an infinite number of processors. Length of the longest path (critical path) in the DAG.

The DAG (Directed Acyclic Graph)

A program execution using fork and join can be visualized as a DAG:

Nodes: Represent pieces of work (tasks).
Edges: Represent dependencies. An edge from node A to node B means that B cannot start until A finishes.
- fork creates two outgoing edges (new thread and continuation of current thread).
- join creates a node with two incoming edges (task ended and last node of thread joined on).

The structure for ForkJoin is based on a very simple structure.

Connecting back to the main points:

Work = T1
Span = T_inf

Speedup:

The speed up is T1/Tp. If the speed-up is the same as P, then we have perfect linear speedup.
Parallelism: The maximum possible speedup, $T_{1} / T_{\infty}$ .

Optimal $T_{p}$ (Thanks to the ForkJoin Framework)

Ideally, we’d like the execution time on $p$ processors ( $T_{p}$ ) to be as close as possible to the theoretical lower bounds. The ForkJoin Framework provides a strong guarantee:

T_{p} = O ((T_{1} / p) + T_{\infty})

$T_{1} / p$ : The “work” term. If we have few processors (small $p$ ), this term dominates.
$T_{\infty}$ : The “span” term. If we have many processors (large $p$ ), this term dominates.

This guarantee is asymptotically optimal. The ForkJoin Framework’s scheduler uses a technique called work-stealing to achieve this.

Division of Responsibility

Programmer:
- Choose a good parallel algorithm.
- Structure the program using fork and join (or higher-level abstractions like map and reduce).
- Make sure the tasks are small-ish and roughly equal in size. The scheduler can’t just figure out how long a task takes (see image below).
Framework:
- Schedule tasks onto available processors efficiently (using work-stealing).
- Keep overhead low.
- Provide the expected-time guarantee: $T_{p} = O ((T_{1} / p) + T_{\infty})$ .

Examples:

Let’s apply this to some examples:

Summing an array (or other map/reduce operations):
- $T_{1} = O (n)$ (linear work)
- $T_{\infty} = O (lo g n)$ (logarithmic span)
- $T_{p} = O (n / p + lo g n)$
Hypothetical Example:
- $T_{1} = O (n^{2})$
- $T_{\infty} = O (n)$
- $T_{p} = O (n^{2} / p + n)$

Amdahl’s Law (Revisited)

Amdahl’s Law still applies, even with the ForkJoin Framework. If a program has a significant sequential portion, that portion will limit the overall speedup, no matter how many processors you have.

However, Amdahl’s Law doesn’t mean that parallelism is useless!

We can often find new parallel algorithms for problems that seem inherently sequential.
We can change the problem or find new applications where parallelism is beneficial.

The Prefix-Sum Problem: A Classic Example

Let’s consider a problem that, at first glance, seems inherently sequential: the prefix-sum problem.

Problem: Given an input array input, produce an output array output where each element output[i] is the sum of all elements in input from index 0 up to i.

output[i] = input[0] + input[1] + ... + input[i]

Sequential Solution:

int[] prefix_sum(int[] input) {
    int[] output = new int[input.length];
    output[0] = input[0];
    for (int i = 1; i < input.length; i++) {
        output[i] = output[i - 1] + input[i];
    }
    return output;
}

This code is inherently sequential. To calculate output[i], you must know the value of output[i-1]. It seems like we can’t parallelize this.

Work: $O (n)$
Span: $O (n)$

A Parallel Solution: Two Passes

Surprisingly, there’s a clever parallel algorithm for prefix-sum that achieves $O (n)$ work and $O (lo g n)$ span. It uses a divide-and-conquer approach with two passes:

Up Pass (Build Tree): We build a binary tree bottom-up. Each node in the tree stores the sum of a range of elements in the input array.
- The root stores the sum of the entire range [0, n).
- Each internal node stores the sum of the range covered by its children.
- Leaves store the values of individual input elements.
Down Pass (Propagate fromLeft): We traverse the tree top-down, passing a value called fromLeft to each node.
- The root receives a fromLeft value of 0.
- Each node passes fromLeft to its left child.
- Each node passes (fromLeft + sum of left child) to its right child.
- At each leaf, output[i] = fromLeft + input[i].

Up Pass (Algorithm)

D&C calculation of sums. Runs in $O (lo g n)$ .

Down Pass (Algorithm)

Root node starts with fromleft = 0. Every node has fromleft = parent_fromleft + hasleftnode ? left_sum : 0. This too runs in $O (lo g n)$ .

Sequential Cutoff

As always, we can add a sequential cutoff to improve performance by reducing the overhead of task creation for small subproblems.

Generalized Parallel Prefix

The prefix-sum algorithm is a specific instance of a more general pattern called parallel prefix. We can use the same two-pass approach to compute other “prefix” operations:

Minimum/maximum of all elements to the left of i.
Is there an element to the left of i satisfying a given property?
Count of elements to the left of i satisfying a given property.

Pack (Filter): Another Useful Pattern

The pack operation (sometimes called “filter”) is another important parallel pattern.

Problem: Given an input array input and a predicate (a boolean function) f, produce an output array output containing only the elements from input for which f(element) is true. The elements in output should be in the same order as they appear in input.

Example:

input: [17, 4, 6, 8, 11, 5, 13, 19, 0, 24]
f(x): x > 10
output: [17, 11, 13, 19, 24]

It seems difficult, since we do not have an obvious way to store the packed elements.

Parallel Pack with Prefix-Sum

We can implement an efficient parallel pack using a combination of map, prefix-sum, and another map:

Map (Bit Vector): Apply the predicate f to each element of the input array, producing a bit vector. A 1 in the bit vector indicates that the corresponding element satisfies the predicate (should be included in the output), and a 0 indicates that it should not.
- input: [17, 4, 6, 8, 11, 5, 13, 19, 0, 24]
- bits: [1, 0, 0, 0, 1, 0, 1, 1, 0, 1]
Prefix-Sum (Indices): Perform a parallel prefix-sum on the bit vector. This gives us the indices where the “true” elements should be placed in the output array.
- bitsum: [1, 1, 1, 1, 2, 2, 3, 4, 4, 5]
Map (Output): Create the output array. For each element in the input array, if its corresponding bit in the bit vector is 1, copy the element to the output array at the index indicated by the bitsum array (minus 1, because of 0-based indexing).
```
output = new array of size bitsum[n-1]
FORALL(i=0; i < input.length; i++){
  if(bits[i]==1)
    output[bitsum[i]-1] = input[i];
}
```

Analysis:

Work: $O (n)$ (dominated by the map and prefix-sum operations)
Span: $O (lo g n)$ (dominated by the prefix-sum operation)

Parallel Quicksort

Finally, let’s revisit Quicksort and see how we can parallelize it. Recall the basic steps of Quicksort:

Pick a pivot: Choose an element from the array.
Partition: Rearrange the array so that elements less than the pivot come before it, and elements greater than the pivot come after it.
Recurse: Recursively sort the subarrays to the left and right of the pivot.

Easy Parallelism

The easiest way to parallelize Quicksort is to perform the two recursive calls in parallel (using fork and join).

Work: Remains $O (n lo g n)$ (on average).
Span: $T (n) = O (n) + T (n /2) = O (n)$ . The partition step is still sequential, so the span is linear.
Parallelism: $O (lo g n)$ . This is not very good.

It turns out, however, that the partition step can be parallelized using the techniques we’ve learned (specifically, pack).

Parallel Partition

We can perform the partition in parallel using two pack operations:

Pack elements less than the pivot: Create a new array containing only the elements from the input array that are less than the pivot.
Pack elements greater than the pivot: Create another new array containing only the elements from the input array that are greater than the pivot.
Place pivot in between.

Each of these pack operations has $O (n)$ work and $O (lo g n)$ span. Since we do them in sequence, the total work for partition is $O (n)$ and the total span is $O (lo g n)$ .

With a parallel partition, the recurrence for the span of Quicksort becomes:

$T (n) = O (lo g n) + T (n /2) = O (lo g^{2} n)$

This gives us a parallelism of $O (n / lo g n)$ , which is much better than $O (lo g n)$ .

Summary

Task Parallelism: A powerful paradigm for expressing parallel computations, based on the idea of dividing work into independent tasks.
Divide and Conquer: A natural fit for task parallelism, where problems are recursively broken down into smaller subproblems.
ForkJoin Framework: Java’s library for efficient task parallelism, using a work-stealing scheduler.
Maps and Reductions: Fundamental building blocks for many parallel algorithms.
Parallel Prefix-Sum and Pack: Clever algorithms that demonstrate how seemingly sequential problems can be parallelized.
Parallel Quicksort: Achieves $O (lo g^{2} n)$ span by using a parallel partition based on pack.

Shared Memory Concurrency, Locks, and Data Races

We’ve explored how to achieve parallelism using techniques like divide-and-conquer, often leveraging frameworks like ForkJoin. Now, we shift our focus to a fundamental challenge in parallel programming: managing concurrent access to shared resources, particularly shared memory. This involves understanding critical sections, using locks for mutual exclusion, and identifying potential pitfalls like data races and bad interleavings.

Big Picture (Part I)

Let’s keep the overall system architecture in mind: physical memory, processes (like the JVM), threads (JVM and OS), schedulers, and CPU cores. Our goal now is to understand how threads within the same process can safely interact when they access the same data in memory.

Topics

Shared Memory: Critical Sections, Mutual Exclusion, Concept of Locks
Locks in Java: Understanding Java’s built-in synchronization mechanisms.
Race Conditions: Distinguishing between Data Races and Bad Interleavings.
Guidelines for Concurrent Programming: Best practices for writing correct and efficient concurrent code.

So far, our study of parallel algorithms using fork-join focused on lowering the span (critical path length) via parallel tasks.

These algorithms employed a simple structure to avoid race conditions (problems arising from concurrent access to shared resources):

Memory Partitioning: Each thread typically operated on its own distinct piece of memory (e.g., a sub-range of an array).
Memory Loaning: When a thread “forked” a new task (the “forkee”), the parent thread would temporarily “loan” memory to the forkee and wouldn’t access that memory again until after “joining” with the forkee.

However, this simple strategy isn’t sufficient in many real-world scenarios:

Overlapping Access: Memory access patterns might be unpredictable or inherently overlapping.
Independent Tasks, Shared Resources: Threads might be performing different, independent tasks but still need access to the same shared resources (e.g., a shared database connection, a common configuration object).

Managing State: The Core Challenge

The main challenge in parallel programming boils down to managing state, especially when that state can be modified (mutable). There are three primary approaches:

Immutability:
- The data never changes after creation.
- This is the best option whenever possible, as it completely eliminates concurrency issues related to modification. If data doesn’t change, multiple threads can read it concurrently without any problems.
Isolated Mutability:
- The data can change, but only one thread or task is ever allowed to access it.
- This avoids conflicts because there’s no concurrent access. Think of thread-local data.
Mutable/Shared Data:
- The data can change, and multiple threads or tasks can potentially access it concurrently.
- This is the most complex scenario and requires careful synchronization mechanisms to ensure correctness.

Mutable/Shared Data: The Need for Protection

Mutable shared data is common in shared-memory architectures (like typical multi-core processors).

Problem: Concurrent accesses (reads and writes) by multiple threads to the same mutable data can lead to inconsistencies and incorrect program behavior.
Solution: We must protect the shared state by ensuring that only one task or thread can access it at any given time.

Dealing with Mutable/Shared State

When state is both mutable and shared, it needs to be protected to maintain consistency. This generally means ensuring:

Exclusive Access: Only one thread can modify the state at a time.
Atomicity: Operations on the shared state appear to happen instantaneously (“all or nothing”). Intermediate, potentially inconsistent states resulting from partial operations should not be observable by other threads.

Methods for Protection:

Locks: Mechanisms provided by programming languages or libraries to ensure exclusive access to a section of code (the critical section).
- Challenge: Ensuring correctness and good performance with locks can be difficult, especially in large, complex programs.
Transactional Memory (TM): A more advanced approach where the programmer declares a sequence of actions that need to be atomic. The underlying system (hardware or software) attempts to execute these actions atomically, often using optimistic concurrency and rollback mechanisms.
- Benefit: Can be easier for the programmer.
- Challenge: Achieving good performance can be difficult. (Locks are the more common approach we’ll focus on.)

Continue here: 12 Shared Memory Concurrency, Locks and Data Races

CS Notes

Explorer

11 Parallel Algorithms (Part II)

Faces4Med: Digital Faces in Medical Applications

Early Work and Datasets

Evolution Towards Modern Techniques

Medical vs. Entertainment: The Importance of Internal Structures

Inferring and Modeling Anatomy

Discretization for Simulation

Medical Use Cases

ForkJoin Framework & Task Parallel Algorithms

Analyzing Parallel Algorithms

Work and Span (Recap)

The DAG (Directed Acyclic Graph)

Optimal $T_{p}$ (Thanks to the ForkJoin Framework)

Division of Responsibility

Amdahl’s Law (Revisited)

The Prefix-Sum Problem: A Classic Example

A Parallel Solution: Two Passes

Up Pass (Algorithm)

Down Pass (Algorithm)

Sequential Cutoff

Generalized Parallel Prefix

Pack (Filter): Another Useful Pattern

Parallel Pack with Prefix-Sum

Parallel Quicksort

Easy Parallelism

Parallel Partition

Summary

Shared Memory Concurrency, Locks, and Data Races

Big Picture (Part I)

Topics

Managing State: The Core Challenge

Mutable/Shared Data: The Need for Protection

Dealing with Mutable/Shared State

Table of Contents

Graph View

CS Notes

Explorer

11 Parallel Algorithms (Part II)

Faces4Med: Digital Faces in Medical Applications

Early Work and Datasets

Evolution Towards Modern Techniques

Medical vs. Entertainment: The Importance of Internal Structures

Inferring and Modeling Anatomy

Discretization for Simulation

Medical Use Cases

ForkJoin Framework & Task Parallel Algorithms

Analyzing Parallel Algorithms

Work and Span (Recap)

The DAG (Directed Acyclic Graph)

Optimal Tp​ (Thanks to the ForkJoin Framework)

Division of Responsibility

Amdahl’s Law (Revisited)

The Prefix-Sum Problem: A Classic Example

A Parallel Solution: Two Passes

Up Pass (Algorithm)

Down Pass (Algorithm)

Sequential Cutoff

Generalized Parallel Prefix

Pack (Filter): Another Useful Pattern

Parallel Pack with Prefix-Sum

Parallel Quicksort

Easy Parallelism

Parallel Partition

Summary

Shared Memory Concurrency, Locks, and Data Races

Big Picture (Part I)

Topics

Toward Sharing Resources (Memory)

Managing State: The Core Challenge

Mutable/Shared Data: The Need for Protection

Dealing with Mutable/Shared State

Table of Contents

Graph View

Optimal $T_{p}$ (Thanks to the ForkJoin Framework)