Lecture from: 19.03.2024 | Video: Video ETHZ

ForkJoin Framework & Task Parallel Algorithms

Using the ForkJoin Framework: Recursive Sum Example

Continuation…

Now, let’s define the SumForkJoin task:

class SumForkJoin extends RecursiveTask<Long> {
    int low;
    int high;
    int[] array;
 
    SumForkJoin(int[] arr, int lo, int hi) {
        array = arr;
        low = lo;
        high = hi;
    }
 
    protected Long compute() { /* ... */ } // We'll fill this in next
}
  • SumForkJoin extends RecursiveTask<Long>: We extend RecursiveTask<Long> because our task returns a Long (the sum).
  • Fields: We have fields to store the input data (the array and the range).
  • compute() method: This is where the actual work will happen.

Here’s the compute() method:

protected Long compute() {
    if (high - low <= 1) { // Base case: small enough range
        return (long) array[high-1]; //corrected
    } else {
        int mid = low + (high - low) / 2;
        SumForkJoin left = new SumForkJoin(array, low, mid);
        SumForkJoin right = new SumForkJoin(array, mid, high);
 
        left.fork();  // Start the left subtask asynchronously
        //right.fork(); //WRONG
        long rightResult = right.compute(); // Compute the right subtask *in the current thread*
        long leftResult = left.join();   // Wait for the left subtask and get its result
 
        return leftResult + rightResult;
    }
}
  • Base Case: If the range is small enough (in this case, a single element), we return the value.
  • Divide: We split the range in half.
  • Conquer:
    • We create a SumForkJoin task (left) for the left half.
    • We fork() the left task. This adds it to the ForkJoinPool’s work queue, making it eligible for execution by a worker thread.
    • We directly call compute() on the right task. This executes the right subtask within the current thread. This is the key optimization to reduce thread creation.
    • We join() the left task. This waits for the left task to complete and retrieves its result.
  • Combine: We add the results from the left and right subtasks and return the sum.

Key Differences from the Thread Version:

  • We extend RecursiveTask<Long> instead of Thread.
  • We override compute() instead of run().
  • compute() returns a value (the sum).
  • We use fork() to initiate asynchronous execution and join() to wait for and retrieve results.
  • We recursively call compute() for one of the subtasks, reducing thread creation.

However, the initial result may perform pretty badly!

Here’s a better approach, using SEQUENTIAL_THRESHOLD.

protected Long compute() {
 if (high - low <= SEQUENTIAL_THRESHOLD) {
  long sum = 0;
  for (int i = low; i < high; ++i)
   sum += array[i];
  return sum;
 } else {
  int mid = low + (high - low) / 2;
  SumForkJoin left = new SumForkJoin(array, low, mid);
  SumForkJoin right = new SumForkJoin(array, mid, high);
  left.fork();
  long rightAns = right.compute();
  long leftAns = left.join();
  return leftAns + rightAns;
 }
}

Key takeaways:

  • Add a sequential threshold.
  • The library needs to “warm up”, so you may have to run computations in a loop.

Practical Considerations

While the ForkJoin Framework is powerful, there are some practical considerations:

  • Sequential Threshold: As with our earlier examples, using a sequential cutoff is essential for good performance. The ForkJoin Framework documentation recommends tasks that perform roughly 100-5000 basic operations.
  • Warm-up: The Java Virtual Machine (JVM) performs just-in-time (JIT) compilation and optimization. The first few executions of your parallel code might be slower as the JVM optimizes the code. Run your computations in a loop to see the “long-term” benefits.
  • Overhead: Even with the ForkJoin Framework, there’s still some overhead associated with task creation and management. For very small problems, the overhead might outweigh the benefits of parallelism.

Beyond Summation: Maps and Reductions

We’ve focused on array summation, but the divide-and-conquer, fork-join approach applies to a much broader class of problems.

The same structure applies to other similar operations.

Examples

  • Maximum/Minimum: Find the maximum or minimum element in an array.
  • Element Search: Check if an array contains an element satisfying a given property (e.g., is there a 17?).
  • Leftmost Element: Find the leftmost element satisfying a given property.
  • Counts: Count the number of elements satisfying a given property (e.g., the number of strings starting with a vowel).

These are all examples of reductions.

Reductions

  • A reduction takes a collection of data and produces a single answer.
  • The combining operation must be associative. (e.g., (a + b) + c = a + (b + c))
  • Examples: sum, max, min, count, product, and, or.
  • Non-examples: median, subtraction, exponentiation (these are not associative).

The results of reductions don’t have to be single numbers. They can be more complex objects, like arrays or custom data structures.

Maps

A map operation applies a function to each element of a collection independently, producing a new collection of the same size. There’s no combining of results. Maps are embarrassingly parallel.

Example:

int[] vector_add(int[] arr1, int[] arr2) {
    assert (arr1.length == arr2.length);
    int[] result = new int[arr1.length];
    //FORALL (i = 0; i < arr1.length; i++) { // This is a map operation
    //    result[i] = arr1[i] + arr2[i];
    //}
    
    //The above can be performed with a simple compute and fork
  
    return result;
}

Example of a multi-dimensional map on images:

Example of bluring using a 2d mapping:

Edge detection using a 2d mapping:

Maps and Reductions: The Workhorses of Parallel Programming

Maps and reductions are fundamental patterns in parallel programming. Many parallel algorithms can be expressed in terms of these two operations. Learning to recognize when a problem can be solved with maps and reductions is a key skill for parallel programming.

MapReduce (Digression)

You might have heard of Google’s MapReduce (or the open-source version, Hadoop). These systems are designed for large-scale data processing on clusters of machines. They use the same basic map and reduce concepts, but they handle the complexities of distributing data and managing fault tolerance across many machines.

Linked Lists vs. Arrays/Trees

Data structures matter! While maps and reductions can be applied to linked lists, they’re much less efficient than on arrays or balanced trees.

  • Arrays and Balanced Trees: Allow for access to any element (with appropriate indexing or tree traversal). This enables efficient divide-and-conquer.
  • Linked Lists: Require time to access an arbitrary element, making divide-and-conquer less effective.

Note, even in sequential, non parallelized work units, the work unit itself can be subdivided into parallel tasks like in video games or physics renderings…

Continue here: 11 Parallel Algorithms (Part II)