Lecture from: 12.03.2024 | Video: Video ETHZ
Parallel Programming: Basic Concepts
Amdahl’s Law: The Limit of Parallel Speedup
Continuation…
What happens if we have an infinite number of processors ()?
This is a crucial result. Even with an infinite number of processors, the speedup is limited by the inverse of the sequential fraction. If 10% of the program is sequential (), the maximum speedup is 10, no matter how many processors you have.
This visually shows the diminishing returns of adding more processors when a portion of the program is inherently sequential.
This is a graph showing the potential speed up for various fractions.
This is another graph showing efficiency.
Remarks about Amdahl’s Law:
- It concerns the maximum speedup. Real-world performance will likely be worse due to overheads and architectural limitations.
- It’s often considered “bad news” because it highlights the limitations of parallelism.
- The key takeaway: All non-parallel parts of a program, no matter how small, can cause problems and limit scalability.
- Amdahl’s Law emphasizes the importance of minimizing the sequential fraction of a program. Even small reductions in the sequential portion can lead to significant performance gains.
- It also suggests that hardware improvements that reduce the execution time of sequential code can be very valuable.
Gustafson’s Law: A More Optimistic View
Gustafson’s Law provides an alternative, more optimistic perspective on parallel speedup. It’s based on different assumptions:
- Amdahl’s Law: Assumes a fixed problem size. The total amount of work () is constant.
- Gustafson’s Law: Assumes a fixed execution time. We’re willing to solve larger problems if we have more processors, keeping the execution time constant.
Observations:
- We consider the problem size, not just the execution time.
- We assume the run-time, not the problem size, is constant.
- More processors allow us to solve larger problems in the same amount of time.
- The parallel part of a program often scales with the problem size. For example, if you double the size of an image to be processed, the amount of parallel work (processing individual pixels) also roughly doubles.
These graphs illustrate Gustafson’s Law. Notice that as we increase the number of processors, we also increase the size of the parallelizable work, keeping the overall execution time roughly constant. The serial portion remains the same size.
Let:
- : The fraction of time spent on the sequential part (no speedup from this part).
- : total work
This visually shows the distinction.
Summary
- Parallel Speedup: We aim to achieve speedup by using multiple processors.
- Amdahl’s Law and Gustafson’s Law: Two fundamental laws that provide different perspectives on the limits and potential of parallel speedup. Amdahl’s Law highlights the limitations imposed by the sequential fraction, while Gustafson’s Law emphasizes the ability to solve larger problems with more processors.
- Parallelism: Task/Thread Granularity: The size of the tasks in a parallel program is a crucial design consideration, balancing the benefits of fine-grained parallelism with the overheads of task management.
Divide and Conquer, Cilk-style Bounds
We’ll be focusing on how to structure our parallel algorithms effectively and how to analyze their performance using a task-based model.
Recap: The Big Picture (Again)
Just to keep us grounded, let’s remember the overall system architecture. We have the physical memory, processes (like the JVM), threads within those processes, the operating system managing everything, and the CPU cores doing the actual work. Our goal is to write programs that utilize these resources efficiently.
A Concrete Example: Summing Array Elements
To make things concrete, we’ll work through a simple example: summing the elements of an array. This might seem trivial, but it illustrates the core principles of parallel algorithm design.
The Sequential Version: A Baseline
Before we even think about parallelism, it’s crucial to have a correct sequential version of our program. Why?
- Validation: The sequential version serves as a reference point. We can compare the results of our parallel program to the sequential version to ensure correctness. If they don’t match, we know we have a bug somewhere.
- Performance Baseline: We need a baseline to measure the performance improvement (speedup) achieved by our parallel program. We write parallel programs to make things faster, so we need to know how fast the sequential version is.
Here’s the simple, sequential code for summing an array in Java:
public static int sum(int[] input) {
int sum = 0;
for (int i = 0; i < input.length; i++) {
sum += input[i];
}
return sum;
}
This is straightforward, iterative code. It processes each element of the array one by one, accumulating the sum.
A First (Naive) Attempt at Parallelism
Let’s try a first, intuitive approach to parallelizing this. We could divide the array into four equal parts and have four threads, each summing one part. Then, we combine the four partial sums to get the final result.
- Create 4
Thread
objects: Each thread is responsible for summing a portion of the array. start()
the threads: This initiates the parallel execution of the threads.join()
the threads: The main thread waits for all the worker threads to finish. This is crucial for correctness.- Combine the results: Add up the partial sums calculated by each thread.
Important Warning: This initial approach, while seemingly straightforward, has significant drawbacks. We’ll address these shortly.
Code Walkthrough (First Attempt)
Let’s look at the code (part 1):
class SumThread extends java.lang.Thread {
int lo; // Starting index (inclusive)
int hi; // Ending index (exclusive)
int[] arr; // The array to be summed
int ans = 0; // The partial sum calculated by this thread
SumThread(int[] a, int l, int h) {
lo = l;
hi = h;
arr = a;
}
public void run() {
for (int i = lo; i < hi; i++) {
ans += arr[i];
}
}
}
SumThread
class: This class extendsjava.lang.Thread
, representing a single thread of execution.lo
,hi
,arr
: These fields store the input data for the thread: the portion of the array it’s responsible for (lo
andhi
define the range) and a reference to the array itself.ans
: This field stores the result of the thread’s computation – the partial sum.SumThread(int[] a, int l, int h)
(Constructor): Initializes the thread with its input data.run()
method: This is the core of the thread’s work. It’s automatically called when the thread starts. It iterates through its assigned portion of the array and calculates the partial sum.
Key Point: Because the run()
method in java.lang.Thread
takes no arguments and returns no value, we use fields (lo
, hi
, arr
, ans
) to communicate between the main thread and the worker threads.
Now, let’s look at the main part of the program (part 2):
int sum(int[] arr) {
int len = arr.length;
int ans = 0;
SumThread[] ts = new SumThread[4]; // Create an array of 4 threads
for (int i = 0; i < 4; i++) { // Create and start the threads
ts[i] = new SumThread(arr, i * len / 4, (i + 1) * len / 4);
//This is WRONG.
}
for (int i = 0; i < 4; i++) { // THIS IS WHERE WE ARE WRONG
ans += ts[i].ans;
}
return ans;
}
sum(int[] arr)
: This is our main function that orchestrates the parallel summation.SumThread[] ts = new SumThread[4];
: We create an array to hold our fourSumThread
objects.- First
for
loop: This loop creates the four threads, assigning each thread a quarter of the array. The starting and ending indices (lo
andhi
) are calculated to divide the array evenly. - Second
for
loop Combines the results
Major Problem: This code is incorrect. It has a race condition. The main thread is trying to read the ans
values from the worker threads before those threads have finished their calculations. The ans
values might be uninitialized or only partially computed.
To fix this, we need to wait for the threads to finish. We do this using the start()
method.
int sum(int[] arr) {
int len = arr.length;
int ans = 0;
SumThread[] ts = new SumThread[4]; // Create an array of 4 threads
for (int i = 0; i < 4; i++) { // Create and start the threads
ts[i] = new SumThread(arr, i * len / 4, (i + 1) * len / 4);
ts[i].start(); //Actually starts the thread
}
for (int i = 0; i < 4; i++) { // THIS IS WHERE WE ARE WRONG
ans += ts[i].ans;
}
return ans;
}
However, we have a different problem: the second loop starts immediately, and we can still have a race condition.
Let’s add the join:
int sum(int[] arr) {
int len = arr.length;
int ans = 0;
SumThread[] ts = new SumThread[4];
for (int i = 0; i < 4; i++) {
ts[i] = new SumThread(arr, i * len / 4, (i + 1) * len / 4);
ts[i].start();
}
for (int i = 0; i < 4; i++) {
ts[i].join(); // Wait for thread i to finish
ans += ts[i].ans;
}
return ans;
}
ts[i].join();
: This line is crucial. Thejoin()
method makes the current thread (in this case, the main thread) wait until the thread represented byts[i]
has terminated. This ensures that theans
value is fully computed before we try to read it.
Discussion:
- The
Thread
class in Java provides methods likestart()
andjoin()
that you couldn’t implement yourself.start()
creates a new thread of execution, andjoin()
provides a mechanism for synchronization. - Race Condition: Without
join()
, we would have a race condition. Multiple threads would be accessing and modifying theans
field concurrently, leading to unpredictable and incorrect results. - Fork/Join: This style of parallel programming, where we create (fork) threads and then wait for them to finish (join), is called fork/join parallelism.
- Exception:
join
can also throwInterruptedException
.
Shared Memory and Race Conditions
Although fork/join programming in Java doesn’t require us to think too much about shared memory initially, it’s important to understand that shared memory is involved.
lo
,hi
,arr
: These fields are written by the main thread and read by the worker threads. This is shared memory access.ans
: This field is written by the worker threads and read by the main thread. This is also shared memory access.
Whenever multiple threads access and modify shared memory, there’s a potential for race conditions. We need mechanisms (like join()
in this case, or more generally, locks and other synchronization primitives) to ensure that these accesses happen in a controlled and predictable way.
Problems with Our First Attempt
Even with join()
, our first attempt at parallelizing the array summation has several serious drawbacks:
- Not Reusable/Efficient: The code is hardcoded to use exactly four threads. This is not portable. If we run the code on a machine with more or fewer cores, it won’t adapt. We want code that can scale with the available hardware.
We can improve this slightly by parameterizing the number of threads:
int sum(int[] arr, int numThreads) {
int ans = 0;
SumThread[] ts = new SumThread[numThreads];
for (int i = 0; i < numThreads; i++) {
ts[i] = new SumThread(arr, (i * arr.length) / numThreads,
((i + 1) * arr.length) / numThreads);
ts[i].start();
}
for (int i = 0; i < numThreads; i++) {
ts[i].join();
ans += ts[i].ans;
}
return ans;
}
Now, at least, we can control the number of threads. But this still has problems.
-
Processor Utilization: We want to use the processors that are available to us now. Other programs might be running, or other parts of our own program might be using threads. We might even be running inside another parallel computation! The number of available cores can even change during the execution of our program. Hardcoding the number of threads is inflexible.
-
Load Imbalance: Although unlikely in the simple case of array summation, in general, different subproblems might take different amounts of time to complete.
For example, imagine we’re applying a complex function f
to each element of the array. If f
is much slower for some elements than others (e.g., checking if a large number is prime), then dividing the array into equal-sized chunks might lead to some threads finishing much earlier than others. This is called load imbalance, and it reduces efficiency.
The above concept, summarized:
A Better Approach: Lots of Small Tasks
The counterintuitive solution to these problems is to use many more threads than the number of processors. This seems like it would be less efficient, but it actually helps us address the problems we’ve identified:
- Forward Portability: If we have many small tasks, we can easily adapt to different numbers of processors.
- Processor Availability: The system (operating system or runtime library) can dynamically schedule the small tasks onto the available processors.
- Load Imbalance: If some tasks take longer than others, the scheduler can redistribute the remaining tasks to keep all processors busy. If the tasks are small enough, the variation in execution time is less likely to cause significant imbalance.
However, this approach requires us to change our algorithm. We can’t just divide the array into a fixed number of large chunks. We need a way to create many small, independent tasks.
Divide and Conquer to the Rescue
This is where the divide and conquer paradigm comes in. It’s a powerful and natural way to express parallelism, especially for problems that can be broken down recursively.
The basic idea of divide and conquer is:
- Base Case: If the problem is small enough (cannot be divided further), solve it directly (the “unitary solution”).
- Divide: Divide the problem into two (or more) smaller subproblems.
- Conquer: Solve the subproblems recursively.
- Combine: Combine the solutions to the subproblems to obtain the solution to the original problem.
This is also called recursive splitting.
Sequential Recursive Sum
Let’s rewrite our array summation using a divide-and-conquer approach, but still sequentially:
public static int do_sum_rec(int[] xs, int l, int h) {
int size = h - l;
if (size == 1) { // Base case: single element
return xs[l];
}
int mid = size / 2; // Divide: split the array in half
int sum1 = do_sum_rec(xs, l, l + mid); // Conquer: solve the first half
int sum2 = do_sum_rec(xs, l + mid, h); // Conquer: solve the second half
return sum1 + sum2; // Combine: add the results
}
This code does the same thing as the iterative version, but it does it recursively. It might seem less efficient, but it sets the stage for parallelization.
Parallel Recursive Sum (Almost)
Now, let’s try to make this parallel. The key idea is to make the recursive calls concurrently.
Here’s a first attempt at the SumThread
class:
public class SumThread extends Thread {
int[] xs;
int h, l;
int result;
public SumThread(int[] xs, int l, int h) {
super();
this.xs = xs;
this.h = h;
this.l = l;
}
public void run(){
/*Do computation and write to result*/
return;
}
}
And here’s the run
method:
public void run() {
int size = h - l;
if (size == 1) {
result = xs[l];
return;
}
int mid = size / 2;
SumThread t1 = new SumThread(xs, l, l + mid);
SumThread t2 = new SumThread(xs, l + mid, h);
t1.start();
t2.start();
t1.join();
t2.join();
result = t1.result + t2.result;
return;
}
- Base Case: If the size of the array segment is 1, we just return the single element.
- Divide: We split the array segment in half.
- Conquer: We create two
SumThread
objects,t1
andt2
, to handle the two halves. Westart()
each thread, making them run concurrently. We use thejoin
calls to synchronize. - Combine: We
join()
both threads (wait for them to finish) and then add their results (t1.result
andt2.result
) to get the final result for this segment.
Note, the version above doesn’t compile because join
can throw exceptions, and we need a try catch
block.
Problem
Problem: This code, while conceptually correct, is incredibly inefficient. It creates a huge number of threads, one for each recursive call. For a large array, this will quickly overwhelm the system.
We will likely encounter java.lang.OutOfMemoryError: unable to create new native thread
.
Why is this so bad?
- Thread Overhead: Java threads are relatively “heavyweight.” Creating and managing a thread has significant overhead (memory allocation, context switching, etc.).
- OS Threads: In many Java implementations (including the Oracle JVM), Java threads are mapped to operating system threads. OS threads are even more heavyweight than Java threads. Creating too many OS threads can cripple the system.
JVM vs OS threads:
In general, using one thread per small task is highly inefficient.
We can increase the cutoff (i.e from where we define the base case):
Making it Practical: Sequential Cutoff and Reducing Thread Creation
To make divide-and-conquer practical, we need to address the excessive thread creation. We do this with two key optimizations:
-
Sequential Cutoff: When the size of the subproblem (the array segment) is small enough, we switch to a sequential computation. We don’t create new threads for very small tasks. A typical cutoff value is around 500-1000 elements. This eliminates most of the thread creation overhead, as the vast majority of the recursive calls will be at the bottom levels of the recursion tree.
-
Reduce Thread Creation: Instead of creating two threads for each recursive step, we create one thread and do the other subproblem ourselves (in the current thread). This cuts the number of threads created in half.
Here’s the run
method with the sequential cutoff:
public void run() {
int size = h - l;
if (size < SEQ_CUTOFF) { // Sequential cutoff
for (int i = l; i < h; i++) {
result += xs[i];
}
} else {
int mid = size / 2;
SumThread t1 = new SumThread(xs, l, l + mid);
SumThread t2 = new SumThread(xs, l + mid, h);
t1.start();
t2.start();
t1.join();
t2.join();
result = t1.result + t2.result;
}
}
The SEQ_CUTOFF
constant determines the threshold below which we switch to sequential execution.
Now, let’s reduce the thread creation:
// Wasteful:
SumThread t1 = ...;
SumThread t2 = ...;
t1.start();
t2.start();
t1.join();
t2.join();
result = t1.result + t2.result;
// Better:
SumThread t1 = ...;
// NO T2
t1.start();
//t2.start(); // NO, we compute it on our current thread
t2.run();
t1.join();
//t2.join(); NO, we didn't start the thread
result = t1.result + t2.result;
The above version is better than before, since it calls sum directly and it is much faster.
- We create only one thread (
t1
). - We directly call the
sum
function (orrun
method, depending on the implementation) to handle the other half of the array in the current thread. - We
join()
only the thread we created (t1
).
This optimization cuts the number of threads created by a factor of two. The key point is to do one and call another recursively.
Divide and Conquer: It Really Works!
With these optimizations (sequential cutoff and reducing thread creation), divide-and-conquer becomes a very effective technique for parallel programming.
- Key Idea: Divide-and-conquer parallelizes the result-combining phase. The recursive calls happen in parallel, and the results are combined as the threads finish.
- Time Complexity: If you have enough processors, the total execution time is proportional to the height of the recursion tree, which is for a balanced division. This is exponentially faster than the sequential time.
- Associativity: This approach often relies on the operation being associative (like addition). Associativity means that the order in which we combine the results doesn’t matter:
(a + b) + c
is the same asa + (b + c)
.
We will write many of our parallel algorithms in this divide-and-conquer style.
Important: While we’ve been using Java threads directly for illustration, in practice, we’ll use special libraries designed for fork/join parallelism. These libraries take care of many of the low-level details, such as scheduling tasks and managing threads efficiently.
Recap of problems of threads
We’ll quickly recap. Java threads are actually quite heavyweight:
- They are mapped to OS threads.
- It is very inefficient.
Thread Pools: A Better Way to Manage Threads
Instead of creating a new thread for every task, a more efficient approach is to use a thread pool.
- Thread Pool: A collection of pre-created worker threads that are ready to execute tasks.
- Tasks: Small units of work that can be executed independently.
- Scheduling: We submit tasks to the thread pool, and the pool’s scheduler assigns tasks to available threads.
This avoids the overhead of constantly creating and destroying threads.
Java’s Executor Service
Java provides the ExecutorService
interface for managing asynchronous tasks using a thread pool.
ExecutorService
: An interface that represents an asynchronous execution mechanism.ThreadPoolExecutor
: A concrete implementation ofExecutorService
that uses a thread pool.- Tasks: These tasks are distributed on threads.
submit(Callable<T> task)
: Submits a task that returns a result (of typeT
). Returns aFuture<T>
object.submit(Runnable task)
: Submits a task that doesn’t return a result. Returns aFuture<?>
object.Future
: It represents a submission of a task. The user submits a task to theExecutorService
.
We call this async work.
Callable
vs. Runnable
Runnable
: Represents a task that doesn’t return a value. Itsrun()
method has avoid
return type.Callable<T>
: Represents a task that does return a value (of typeT
). Itscall()
method returns a value of typeT
.
Example: Hello World with ExecutorService
int ntasks = 1000;
ExecutorService exs = Executors.newFixedThreadPool(4); // Create a pool with 4 threads
for (int i = 0; i < ntasks; i++) {
HelloTask t = new HelloTask("Hello from task " + i);
exs.submit(t); // Submit the task to the pool
}
exs.shutdown(); // Initiate shutdown (doesn't wait for tasks to complete)
- Executors: is used to generate a threadpool
Executors.newFixedThreadPool(4)
: Creates a thread pool with a fixed number of threads (4 in this case).exs.submit(t)
: Submits aHelloTask
to the thread pool. The pool will assign the task to an available thread.exs.shutdown()
: Initiates a graceful shutdown of theExecutorService
. It doesn’t wait for currently executing tasks to finish, but it prevents new tasks from being submitted.
Here’s the HelloTask
class:
static class HelloTask implements Runnable {
String msg;
public HelloTask(String msg) {
this.msg = msg;
}
public void run() {
long id = Thread.currentThread().getId(); // Get the ID of the current thread
System.out.println(msg + " from thread:" + id);
}
}
The output shows that the tasks are executed by different threads from the pool.
Continue here: 09 DAG and ForkJoin Framework