Lecture from: 19.03.2024 | Video: Video ETHZ
ForkJoin Framework & Task Parallel Algorithms
Using the ForkJoin Framework: Recursive Sum Example
Continuation…
Now, let’s define the SumForkJoin
task:
class SumForkJoin extends RecursiveTask<Long> {
int low;
int high;
int[] array;
SumForkJoin(int[] arr, int lo, int hi) {
array = arr;
low = lo;
high = hi;
}
protected Long compute() { /* ... */ } // We'll fill this in next
}
SumForkJoin extends RecursiveTask<Long>
: We extendRecursiveTask<Long>
because our task returns aLong
(the sum).- Fields: We have fields to store the input data (the array and the range).
compute()
method: This is where the actual work will happen.
Here’s the compute()
method:
protected Long compute() {
if (high - low <= 1) { // Base case: small enough range
return (long) array[high-1]; //corrected
} else {
int mid = low + (high - low) / 2;
SumForkJoin left = new SumForkJoin(array, low, mid);
SumForkJoin right = new SumForkJoin(array, mid, high);
left.fork(); // Start the left subtask asynchronously
//right.fork(); //WRONG
long rightResult = right.compute(); // Compute the right subtask *in the current thread*
long leftResult = left.join(); // Wait for the left subtask and get its result
return leftResult + rightResult;
}
}
- Base Case: If the range is small enough (in this case, a single element), we return the value.
- Divide: We split the range in half.
- Conquer:
- We create a
SumForkJoin
task (left
) for the left half. - We
fork()
theleft
task. This adds it to theForkJoinPool
’s work queue, making it eligible for execution by a worker thread. - We directly call
compute()
on theright
task. This executes the right subtask within the current thread. This is the key optimization to reduce thread creation. - We
join()
theleft
task. This waits for theleft
task to complete and retrieves its result.
- We create a
- Combine: We add the results from the left and right subtasks and return the sum.
Key Differences from the Thread
Version:
- We extend
RecursiveTask<Long>
instead ofThread
. - We override
compute()
instead ofrun()
. compute()
returns a value (the sum).- We use
fork()
to initiate asynchronous execution andjoin()
to wait for and retrieve results. - We recursively call
compute()
for one of the subtasks, reducing thread creation.
However, the initial result may perform pretty badly!
Here’s a better approach, using SEQUENTIAL_THRESHOLD
.
protected Long compute() {
if (high - low <= SEQUENTIAL_THRESHOLD) {
long sum = 0;
for (int i = low; i < high; ++i)
sum += array[i];
return sum;
} else {
int mid = low + (high - low) / 2;
SumForkJoin left = new SumForkJoin(array, low, mid);
SumForkJoin right = new SumForkJoin(array, mid, high);
left.fork();
long rightAns = right.compute();
long leftAns = left.join();
return leftAns + rightAns;
}
}
Key takeaways:
- Add a sequential threshold.
- The library needs to “warm up”, so you may have to run computations in a loop.
Practical Considerations
While the ForkJoin Framework is powerful, there are some practical considerations:
- Sequential Threshold: As with our earlier examples, using a sequential cutoff is essential for good performance. The ForkJoin Framework documentation recommends tasks that perform roughly 100-5000 basic operations.
- Warm-up: The Java Virtual Machine (JVM) performs just-in-time (JIT) compilation and optimization. The first few executions of your parallel code might be slower as the JVM optimizes the code. Run your computations in a loop to see the “long-term” benefits.
- Overhead: Even with the ForkJoin Framework, there’s still some overhead associated with task creation and management. For very small problems, the overhead might outweigh the benefits of parallelism.
Beyond Summation: Maps and Reductions
We’ve focused on array summation, but the divide-and-conquer, fork-join approach applies to a much broader class of problems.
The same structure applies to other similar operations.
Examples
- Maximum/Minimum: Find the maximum or minimum element in an array.
- Element Search: Check if an array contains an element satisfying a given property (e.g., is there a 17?).
- Leftmost Element: Find the leftmost element satisfying a given property.
- Counts: Count the number of elements satisfying a given property (e.g., the number of strings starting with a vowel).
These are all examples of reductions.
Reductions
- A reduction takes a collection of data and produces a single answer.
- The combining operation must be associative. (e.g.,
(a + b) + c = a + (b + c)
) - Examples: sum, max, min, count, product, and, or.
- Non-examples: median, subtraction, exponentiation (these are not associative).
The results of reductions don’t have to be single numbers. They can be more complex objects, like arrays or custom data structures.
Maps
A map operation applies a function to each element of a collection independently, producing a new collection of the same size. There’s no combining of results. Maps are embarrassingly parallel.
Example:
int[] vector_add(int[] arr1, int[] arr2) {
assert (arr1.length == arr2.length);
int[] result = new int[arr1.length];
//FORALL (i = 0; i < arr1.length; i++) { // This is a map operation
// result[i] = arr1[i] + arr2[i];
//}
//The above can be performed with a simple compute and fork
return result;
}
Example of a multi-dimensional map on images:
Example of bluring using a 2d mapping:
Edge detection using a 2d mapping:
Maps and Reductions: The Workhorses of Parallel Programming
Maps and reductions are fundamental patterns in parallel programming. Many parallel algorithms can be expressed in terms of these two operations. Learning to recognize when a problem can be solved with maps and reductions is a key skill for parallel programming.
MapReduce (Digression)
You might have heard of Google’s MapReduce (or the open-source version, Hadoop). These systems are designed for large-scale data processing on clusters of machines. They use the same basic map and reduce concepts, but they handle the complexities of distributing data and managing fault tolerance across many machines.
Linked Lists vs. Arrays/Trees
Data structures matter! While maps and reductions can be applied to linked lists, they’re much less efficient than on arrays or balanced trees.
- Arrays and Balanced Trees: Allow for access to any element (with appropriate indexing or tree traversal). This enables efficient divide-and-conquer.
- Linked Lists: Require time to access an arbitrary element, making divide-and-conquer less effective.
Note, even in sequential, non parallelized work units, the work unit itself can be subdivided into parallel tasks like in video games or physics renderings…
Continue here: 11 Parallel Algorithms (Part II)