Lecture from: 12.03.2024 | Video: Video ETHZ

Parallel Programming: Basic Concepts

Amdahl’s Law: The Limit of Parallel Speedup

Continuation…

What happens if we have an infinite number of processors ()?

This is a crucial result. Even with an infinite number of processors, the speedup is limited by the inverse of the sequential fraction. If 10% of the program is sequential (), the maximum speedup is 10, no matter how many processors you have.

This visually shows the diminishing returns of adding more processors when a portion of the program is inherently sequential.

This is a graph showing the potential speed up for various fractions.

This is another graph showing efficiency.

Remarks about Amdahl’s Law:

  • It concerns the maximum speedup. Real-world performance will likely be worse due to overheads and architectural limitations.
  • It’s often considered “bad news” because it highlights the limitations of parallelism.
  • The key takeaway: All non-parallel parts of a program, no matter how small, can cause problems and limit scalability.
  • Amdahl’s Law emphasizes the importance of minimizing the sequential fraction of a program. Even small reductions in the sequential portion can lead to significant performance gains.
  • It also suggests that hardware improvements that reduce the execution time of sequential code can be very valuable.

Gustafson’s Law: A More Optimistic View

Gustafson’s Law provides an alternative, more optimistic perspective on parallel speedup. It’s based on different assumptions:

  • Amdahl’s Law: Assumes a fixed problem size. The total amount of work () is constant.
  • Gustafson’s Law: Assumes a fixed execution time. We’re willing to solve larger problems if we have more processors, keeping the execution time constant.

Observations:

  • We consider the problem size, not just the execution time.
  • We assume the run-time, not the problem size, is constant.
  • More processors allow us to solve larger problems in the same amount of time.
  • The parallel part of a program often scales with the problem size. For example, if you double the size of an image to be processed, the amount of parallel work (processing individual pixels) also roughly doubles.

These graphs illustrate Gustafson’s Law. Notice that as we increase the number of processors, we also increase the size of the parallelizable work, keeping the overall execution time roughly constant. The serial portion remains the same size.

Let:

  • : The fraction of time spent on the sequential part (no speedup from this part).
  • : total work

This visually shows the distinction.

Summary

  • Parallel Speedup: We aim to achieve speedup by using multiple processors.
  • Amdahl’s Law and Gustafson’s Law: Two fundamental laws that provide different perspectives on the limits and potential of parallel speedup. Amdahl’s Law highlights the limitations imposed by the sequential fraction, while Gustafson’s Law emphasizes the ability to solve larger problems with more processors.
  • Parallelism: Task/Thread Granularity: The size of the tasks in a parallel program is a crucial design consideration, balancing the benefits of fine-grained parallelism with the overheads of task management.

Divide and Conquer, Cilk-style Bounds

We’ll be focusing on how to structure our parallel algorithms effectively and how to analyze their performance using a task-based model.

Recap: The Big Picture (Again)

Just to keep us grounded, let’s remember the overall system architecture. We have the physical memory, processes (like the JVM), threads within those processes, the operating system managing everything, and the CPU cores doing the actual work. Our goal is to write programs that utilize these resources efficiently.

A Concrete Example: Summing Array Elements

To make things concrete, we’ll work through a simple example: summing the elements of an array. This might seem trivial, but it illustrates the core principles of parallel algorithm design.

The Sequential Version: A Baseline

Before we even think about parallelism, it’s crucial to have a correct sequential version of our program. Why?

  1. Validation: The sequential version serves as a reference point. We can compare the results of our parallel program to the sequential version to ensure correctness. If they don’t match, we know we have a bug somewhere.
  2. Performance Baseline: We need a baseline to measure the performance improvement (speedup) achieved by our parallel program. We write parallel programs to make things faster, so we need to know how fast the sequential version is.

Here’s the simple, sequential code for summing an array in Java:

public static int sum(int[] input) {
    int sum = 0;
    for (int i = 0; i < input.length; i++) {
        sum += input[i];
    }
    return sum;
}

This is straightforward, iterative code. It processes each element of the array one by one, accumulating the sum.

A First (Naive) Attempt at Parallelism

Let’s try a first, intuitive approach to parallelizing this. We could divide the array into four equal parts and have four threads, each summing one part. Then, we combine the four partial sums to get the final result.

  • Create 4 Thread objects: Each thread is responsible for summing a portion of the array.
  • start() the threads: This initiates the parallel execution of the threads.
  • join() the threads: The main thread waits for all the worker threads to finish. This is crucial for correctness.
  • Combine the results: Add up the partial sums calculated by each thread.

Important Warning: This initial approach, while seemingly straightforward, has significant drawbacks. We’ll address these shortly.

Code Walkthrough (First Attempt)

Let’s look at the code (part 1):

class SumThread extends java.lang.Thread {
    int lo; // Starting index (inclusive)
    int hi; // Ending index (exclusive)
    int[] arr; // The array to be summed
    int ans = 0; // The partial sum calculated by this thread
 
    SumThread(int[] a, int l, int h) {
        lo = l;
        hi = h;
        arr = a;
    }
 
    public void run() {
        for (int i = lo; i < hi; i++) {
            ans += arr[i];
        }
    }
}
  • SumThread class: This class extends java.lang.Thread, representing a single thread of execution.
  • lo, hi, arr: These fields store the input data for the thread: the portion of the array it’s responsible for (lo and hi define the range) and a reference to the array itself.
  • ans: This field stores the result of the thread’s computation – the partial sum.
  • SumThread(int[] a, int l, int h) (Constructor): Initializes the thread with its input data.
  • run() method: This is the core of the thread’s work. It’s automatically called when the thread starts. It iterates through its assigned portion of the array and calculates the partial sum.

Key Point: Because the run() method in java.lang.Thread takes no arguments and returns no value, we use fields (lo, hi, arr, ans) to communicate between the main thread and the worker threads.

Now, let’s look at the main part of the program (part 2):

int sum(int[] arr) {
    int len = arr.length;
    int ans = 0;
    SumThread[] ts = new SumThread[4]; // Create an array of 4 threads
 
    for (int i = 0; i < 4; i++) { // Create and start the threads
        ts[i] = new SumThread(arr, i * len / 4, (i + 1) * len / 4);
        //This is WRONG.
    }
 
      for (int i = 0; i < 4; i++) { // THIS IS WHERE WE ARE WRONG
        ans += ts[i].ans;
    }
 
    return ans;
}
  • sum(int[] arr): This is our main function that orchestrates the parallel summation.
  • SumThread[] ts = new SumThread[4];: We create an array to hold our four SumThread objects.
  • First for loop: This loop creates the four threads, assigning each thread a quarter of the array. The starting and ending indices (lo and hi) are calculated to divide the array evenly.
  • Second for loop Combines the results

Major Problem: This code is incorrect. It has a race condition. The main thread is trying to read the ans values from the worker threads before those threads have finished their calculations. The ans values might be uninitialized or only partially computed.

To fix this, we need to wait for the threads to finish. We do this using the start() method.

int sum(int[] arr) {
    int len = arr.length;
    int ans = 0;
    SumThread[] ts = new SumThread[4]; // Create an array of 4 threads
 
    for (int i = 0; i < 4; i++) { // Create and start the threads
        ts[i] = new SumThread(arr, i * len / 4, (i + 1) * len / 4);
        ts[i].start(); //Actually starts the thread
    }
 
      for (int i = 0; i < 4; i++) { // THIS IS WHERE WE ARE WRONG
        ans += ts[i].ans;
    }
 
    return ans;
}

However, we have a different problem: the second loop starts immediately, and we can still have a race condition.

Let’s add the join:

int sum(int[] arr) {
    int len = arr.length;
    int ans = 0;
    SumThread[] ts = new SumThread[4];
 
    for (int i = 0; i < 4; i++) {
        ts[i] = new SumThread(arr, i * len / 4, (i + 1) * len / 4);
        ts[i].start();
    }
 
    for (int i = 0; i < 4; i++) {
        ts[i].join(); // Wait for thread i to finish
        ans += ts[i].ans;
    }
 
    return ans;
}
  • ts[i].join();: This line is crucial. The join() method makes the current thread (in this case, the main thread) wait until the thread represented by ts[i] has terminated. This ensures that the ans value is fully computed before we try to read it.

Discussion:

  • The Thread class in Java provides methods like start() and join() that you couldn’t implement yourself. start() creates a new thread of execution, and join() provides a mechanism for synchronization.
  • Race Condition: Without join(), we would have a race condition. Multiple threads would be accessing and modifying the ans field concurrently, leading to unpredictable and incorrect results.
  • Fork/Join: This style of parallel programming, where we create (fork) threads and then wait for them to finish (join), is called fork/join parallelism.
  • Exception: join can also throw InterruptedException.

Shared Memory and Race Conditions

Although fork/join programming in Java doesn’t require us to think too much about shared memory initially, it’s important to understand that shared memory is involved.

  • lo, hi, arr: These fields are written by the main thread and read by the worker threads. This is shared memory access.
  • ans: This field is written by the worker threads and read by the main thread. This is also shared memory access.

Whenever multiple threads access and modify shared memory, there’s a potential for race conditions. We need mechanisms (like join() in this case, or more generally, locks and other synchronization primitives) to ensure that these accesses happen in a controlled and predictable way.

Problems with Our First Attempt

Even with join(), our first attempt at parallelizing the array summation has several serious drawbacks:

  1. Not Reusable/Efficient: The code is hardcoded to use exactly four threads. This is not portable. If we run the code on a machine with more or fewer cores, it won’t adapt. We want code that can scale with the available hardware.

We can improve this slightly by parameterizing the number of threads:

int sum(int[] arr, int numThreads) {
    int ans = 0;
    SumThread[] ts = new SumThread[numThreads];
    for (int i = 0; i < numThreads; i++) {
        ts[i] = new SumThread(arr, (i * arr.length) / numThreads,
                             ((i + 1) * arr.length) / numThreads);
        ts[i].start();
    }
    for (int i = 0; i < numThreads; i++) {
        ts[i].join();
        ans += ts[i].ans;
    }
    return ans;
}

Now, at least, we can control the number of threads. But this still has problems.

  1. Processor Utilization: We want to use the processors that are available to us now. Other programs might be running, or other parts of our own program might be using threads. We might even be running inside another parallel computation! The number of available cores can even change during the execution of our program. Hardcoding the number of threads is inflexible.

  2. Load Imbalance: Although unlikely in the simple case of array summation, in general, different subproblems might take different amounts of time to complete.

For example, imagine we’re applying a complex function f to each element of the array. If f is much slower for some elements than others (e.g., checking if a large number is prime), then dividing the array into equal-sized chunks might lead to some threads finishing much earlier than others. This is called load imbalance, and it reduces efficiency.

The above concept, summarized:

A Better Approach: Lots of Small Tasks

The counterintuitive solution to these problems is to use many more threads than the number of processors. This seems like it would be less efficient, but it actually helps us address the problems we’ve identified:

  1. Forward Portability: If we have many small tasks, we can easily adapt to different numbers of processors.
  2. Processor Availability: The system (operating system or runtime library) can dynamically schedule the small tasks onto the available processors.
  3. Load Imbalance: If some tasks take longer than others, the scheduler can redistribute the remaining tasks to keep all processors busy. If the tasks are small enough, the variation in execution time is less likely to cause significant imbalance.

However, this approach requires us to change our algorithm. We can’t just divide the array into a fixed number of large chunks. We need a way to create many small, independent tasks.

Divide and Conquer to the Rescue

This is where the divide and conquer paradigm comes in. It’s a powerful and natural way to express parallelism, especially for problems that can be broken down recursively.

The basic idea of divide and conquer is:

  1. Base Case: If the problem is small enough (cannot be divided further), solve it directly (the “unitary solution”).
  2. Divide: Divide the problem into two (or more) smaller subproblems.
  3. Conquer: Solve the subproblems recursively.
  4. Combine: Combine the solutions to the subproblems to obtain the solution to the original problem.

This is also called recursive splitting.

Sequential Recursive Sum

Let’s rewrite our array summation using a divide-and-conquer approach, but still sequentially:

public static int do_sum_rec(int[] xs, int l, int h) {
    int size = h - l;
    if (size == 1) { // Base case: single element
        return xs[l];
    }
    int mid = size / 2; // Divide: split the array in half
    int sum1 = do_sum_rec(xs, l, l + mid); // Conquer: solve the first half
    int sum2 = do_sum_rec(xs, l + mid, h); // Conquer: solve the second half
    return sum1 + sum2; // Combine: add the results
}

This code does the same thing as the iterative version, but it does it recursively. It might seem less efficient, but it sets the stage for parallelization.

Parallel Recursive Sum (Almost)

Now, let’s try to make this parallel. The key idea is to make the recursive calls concurrently.

Here’s a first attempt at the SumThread class:

public class SumThread extends Thread {
    int[] xs;
    int h, l;
    int result;
 
    public SumThread(int[] xs, int l, int h) {
        super();
        this.xs = xs;
        this.h = h;
        this.l = l;
    }
     public void run(){
        /*Do computation and write to result*/
        return;
    }
}

And here’s the run method:

public void run() {
    int size = h - l;
    if (size == 1) {
        result = xs[l];
        return;
    }
    int mid = size / 2;
    SumThread t1 = new SumThread(xs, l, l + mid);
    SumThread t2 = new SumThread(xs, l + mid, h);
    t1.start();
	t2.start();
	
    t1.join();
    t2.join();
 
    result = t1.result + t2.result;
    return;
}
  • Base Case: If the size of the array segment is 1, we just return the single element.
  • Divide: We split the array segment in half.
  • Conquer: We create two SumThread objects, t1 and t2, to handle the two halves. We start() each thread, making them run concurrently. We use the join calls to synchronize.
  • Combine: We join() both threads (wait for them to finish) and then add their results (t1.result and t2.result) to get the final result for this segment.

Note, the version above doesn’t compile because join can throw exceptions, and we need a try catch block.

Problem

Problem: This code, while conceptually correct, is incredibly inefficient. It creates a huge number of threads, one for each recursive call. For a large array, this will quickly overwhelm the system.

We will likely encounter java.lang.OutOfMemoryError: unable to create new native thread.

Why is this so bad?

  • Thread Overhead: Java threads are relatively “heavyweight.” Creating and managing a thread has significant overhead (memory allocation, context switching, etc.).
  • OS Threads: In many Java implementations (including the Oracle JVM), Java threads are mapped to operating system threads. OS threads are even more heavyweight than Java threads. Creating too many OS threads can cripple the system.

JVM vs OS threads:

In general, using one thread per small task is highly inefficient.

We can increase the cutoff (i.e from where we define the base case):

Making it Practical: Sequential Cutoff and Reducing Thread Creation

To make divide-and-conquer practical, we need to address the excessive thread creation. We do this with two key optimizations:

  1. Sequential Cutoff: When the size of the subproblem (the array segment) is small enough, we switch to a sequential computation. We don’t create new threads for very small tasks. A typical cutoff value is around 500-1000 elements. This eliminates most of the thread creation overhead, as the vast majority of the recursive calls will be at the bottom levels of the recursion tree.

  2. Reduce Thread Creation: Instead of creating two threads for each recursive step, we create one thread and do the other subproblem ourselves (in the current thread). This cuts the number of threads created in half.

Here’s the run method with the sequential cutoff:

public void run() {
    int size = h - l;
    if (size < SEQ_CUTOFF) { // Sequential cutoff
        for (int i = l; i < h; i++) {
            result += xs[i];
        }
    } else {
        int mid = size / 2;
        SumThread t1 = new SumThread(xs, l, l + mid);
        SumThread t2 = new SumThread(xs, l + mid, h);
        t1.start();
        t2.start();
        t1.join();
        t2.join();
        result = t1.result + t2.result;
    }
}

The SEQ_CUTOFF constant determines the threshold below which we switch to sequential execution.

Now, let’s reduce the thread creation:

// Wasteful:
SumThread t1 = ...;
SumThread t2 = ...;
t1.start();
t2.start();
t1.join();
t2.join();
result = t1.result + t2.result;
 
// Better:
SumThread t1 = ...;
// NO T2
t1.start();
//t2.start(); // NO, we compute it on our current thread
t2.run();
t1.join();
 
//t2.join(); NO, we didn't start the thread
result = t1.result + t2.result;

The above version is better than before, since it calls sum directly and it is much faster.

  • We create only one thread (t1).
  • We directly call the sum function (or run method, depending on the implementation) to handle the other half of the array in the current thread.
  • We join() only the thread we created (t1).

This optimization cuts the number of threads created by a factor of two. The key point is to do one and call another recursively.

Divide and Conquer: It Really Works!

With these optimizations (sequential cutoff and reducing thread creation), divide-and-conquer becomes a very effective technique for parallel programming.

  • Key Idea: Divide-and-conquer parallelizes the result-combining phase. The recursive calls happen in parallel, and the results are combined as the threads finish.
  • Time Complexity: If you have enough processors, the total execution time is proportional to the height of the recursion tree, which is for a balanced division. This is exponentially faster than the sequential time.
  • Associativity: This approach often relies on the operation being associative (like addition). Associativity means that the order in which we combine the results doesn’t matter: (a + b) + c is the same as a + (b + c).

We will write many of our parallel algorithms in this divide-and-conquer style.

Important: While we’ve been using Java threads directly for illustration, in practice, we’ll use special libraries designed for fork/join parallelism. These libraries take care of many of the low-level details, such as scheduling tasks and managing threads efficiently.

Recap of problems of threads

We’ll quickly recap. Java threads are actually quite heavyweight:

  • They are mapped to OS threads.
  • It is very inefficient.

Thread Pools: A Better Way to Manage Threads

Instead of creating a new thread for every task, a more efficient approach is to use a thread pool.

  • Thread Pool: A collection of pre-created worker threads that are ready to execute tasks.
  • Tasks: Small units of work that can be executed independently.
  • Scheduling: We submit tasks to the thread pool, and the pool’s scheduler assigns tasks to available threads.

This avoids the overhead of constantly creating and destroying threads.

Java’s Executor Service

Java provides the ExecutorService interface for managing asynchronous tasks using a thread pool.

  • ExecutorService: An interface that represents an asynchronous execution mechanism.
  • ThreadPoolExecutor: A concrete implementation of ExecutorService that uses a thread pool.
  • Tasks: These tasks are distributed on threads.

  • submit(Callable<T> task): Submits a task that returns a result (of type T). Returns a Future<T> object.
  • submit(Runnable task): Submits a task that doesn’t return a result. Returns a Future<?> object.
  • Future: It represents a submission of a task. The user submits a task to the ExecutorService.

We call this async work.

Callable vs. Runnable

  • Runnable: Represents a task that doesn’t return a value. Its run() method has a void return type.
  • Callable<T>: Represents a task that does return a value (of type T). Its call() method returns a value of type T.

Example: Hello World with ExecutorService

int ntasks = 1000;
ExecutorService exs = Executors.newFixedThreadPool(4); // Create a pool with 4 threads
 
for (int i = 0; i < ntasks; i++) {
    HelloTask t = new HelloTask("Hello from task " + i);
    exs.submit(t); // Submit the task to the pool
}
 
exs.shutdown(); // Initiate shutdown (doesn't wait for tasks to complete)
  • Executors: is used to generate a threadpool
  • Executors.newFixedThreadPool(4): Creates a thread pool with a fixed number of threads (4 in this case).
  • exs.submit(t): Submits a HelloTask to the thread pool. The pool will assign the task to an available thread.
  • exs.shutdown(): Initiates a graceful shutdown of the ExecutorService. It doesn’t wait for currently executing tasks to finish, but it prevents new tasks from being submitted.

Here’s the HelloTask class:

static class HelloTask implements Runnable {
    String msg;
 
    public HelloTask(String msg) {
        this.msg = msg;
    }
 
    public void run() {
        long id = Thread.currentThread().getId(); // Get the ID of the current thread
        System.out.println(msg + " from thread:" + id);
    }
}

The output shows that the tasks are executed by different threads from the pool.

Continue here: 09 DAG and ForkJoin Framework