09 DAG and ForkJoin Framework

Lecture from: 18.03.2024 | Video: Video ETHZ

Digital Faces

This topic explores the fascinating field of creating and animating digital faces, covering three key aspects:

Geometric Modeling: Representing faces and people geometrically in 3D.
Animation: Moving these 3D models realistically and expressively, focusing on facial animations.
Photorealistic Rendering: Representing the models in a way that appears photorealistic.

Early Work: Geometric Animation

The pioneering work in geometric face animation dates back to 1972, with Ed Catmull and Fred Parke’s groundbreaking efforts.

This early work laid the foundation for representing and animating faces using computer graphics.

The Rise of Face Scanners

Around 15 years ago, face scanners emerged, enabling the capture of detailed 3D facial geometry. One example is the scanner built by ETH Zurich in collaboration with Disney Research around 2010. These scanners typically use multiple cameras to reconstruct the 3D shape of a face.

Multiple cameras are used, and the reconstruction of the 3D data is taken.

High-Fidelity Animation from Scans

Using the detailed geometry captured by these scanners, researchers and artists could create highly detailed facial animations. The blue image below (circa 2010, Disney Research) shows an example of such detailed geometry, allowing for very nuanced and expressive animations.

Motion Retargeting

Once facial movements can be scanned and modeled, an exciting possibility arises: retargeting that motion onto different faces. This means transferring the expressions and movements of one person onto the 3D model of another.

Real-World Applications: Visual Effects

This technology has found significant applications in the visual effects industry. A prominent example is the creation of Thanos in the Avengers movies, where the actor’s facial performance was captured and applied to the digital character.

Beyond Faces: Detailed Scanning of Features

Research has also focused on capturing the intricate details of specific facial features, such as the eyes. This Disney Research example from 2014 demonstrates the level of detail achievable in scanning the eyes.

Generating Realistic Images and Models

The culmination of these advancements is the ability to create not just realistic animations, but also entirely synthetic, yet photorealistic, images and 3D models of people. This involves both accurate geometric modeling and sophisticated rendering techniques.

Divide and Conquer, Cilk-style Bounds

Recursive Sum with ExecutorService (Attempt)

[include image from slide 53]

Let’s try to implement our recursive array summation using an ExecutorService:

public Integer call() throws Exception {
    int size = h - l;
    if (size == 1) {
        return xs[l];
    }
    int mid = size / 2;
    SumRecCall c1 = new SumRecCall(ex, xs, l, l + mid);
    SumRecCall c2 = new SumRecCall(ex, xs, l + mid, h);
 
    Future<Integer> f1 = ex.submit(c1); // Submit the first subtask
    Future<Integer> f2 = ex.submit(c2); // Submit the second subtask
 
    return f1.get() + f2.get(); // Get the results and combine them
}

call() method: We’re now using Callable<Integer> because we need to return a result (the partial sum).
ex.submit(c1) and ex.submit(c2): We submit the two subtasks (c1 and c2) to the ExecutorService. The submit() method returns Future objects.
f1.get() and f2.get(): The get() method of a Future object blocks until the result of the task is available. This is how we retrieve the results of the subtasks.

Problem: This code, while seemingly correct, deadlocks. It never finishes.

Why does this happen?

Each task splits the problem in two and submits two new tasks to the ExecutorService. If the thread pool has a fixed number of threads (e.g., 4), the initial tasks will quickly fill up the pool. Then, these tasks will be waiting for their subtasks to complete, but the subtasks cannot run because all the threads in the pool are already occupied by their parent tasks. This creates a circular dependency, and the program deadlocks.

Decoupling Work Partitioning and Task Execution

One way to address the deadlock problem is to decouple the work partitioning (dividing the array into chunks) from the task execution.

Divide the array into chunks: We can divide the array into a larger number of chunks (e.g., one chunk for every 1000 elements).
Create a task for each chunk: Each task is responsible for summing the elements within its assigned chunk.
Submit tasks to the ExecutorService: The ExecutorService will manage the execution of these tasks using its thread pool.
Combine the results: Once all tasks are finished, we combine their partial sums.

This approach avoids the deadlock because tasks don’t create other tasks. However, it can be tricky to do the initial partitioning and the final combining in parallel.

Cilk-style Task Parallelism: A Better Solution

A more elegant and efficient solution is to use a task-parallel programming model, such as the one pioneered by the Cilk language. This approach provides a higher-level abstraction that simplifies parallel programming and avoids many of the pitfalls we’ve encountered.

Key Concepts:

Tasks: Units of work that can be executed independently.
spawn: A keyword (or function) that creates a new child task. The parent task can continue executing concurrently with the child task.
sync (or wait): A keyword (or function) that makes the current task wait for all its child tasks to complete.

Task Graph:

The execution of a Cilk-style program can be represented as a directed acyclic graph (DAG).

Nodes: Represent tasks.
Edges: Represent dependencies between tasks. An edge from task A to task B means that task B was spawned by task A (or that task B depends on the result of task A).

Let’s see how we can use this to perform Fibonacci.

Sequential Version:

    public class Fibonacci {
        public static long fib(int n){
            if (n < 2)
                return n;
            long x1 = fib(n-1);
            long x2 = fib(n-2);
            return x1 + x2;
        }
    }

Parallel Version:

public class Fibonacci {
        public static long fib(int n){
            if (n < 2)
            return n;
        spawn task for fib(n-1);
        spawn task for fib(n-2);
        wait for tasks to complete
        return addition of task results
    }
}

Here, we showed the visual representation of the fib(4) function. It spawns subtasks.

Key Properties:

Tasks can execute in parallel, but they don’t have to. The scheduler decides which tasks to run on which processors.
The task graph is dynamic. It unfolds as the execution proceeds and tasks spawn other tasks.
A wider task graph generally indicates more potential parallelism.

Task Parallelism: Performance Model

Let’s formalize a performance model for task-parallel programs:

Task Graph: As the computation progresses, tasks become available for execution. We can represent this as a DAG.
$p$ Processors: We assume we have $p$ processors available.
Scheduler: The scheduler is responsible for assigning tasks to processors.
$T_{p}$ : The execution time of the program on $p$ processors.
$T_{1}$ (Work): The total amount of work in the program. This is the sum of the execution times of all the tasks, as if they were executed sequentially (on a single processor).
Speedup: $T_{1} / T_{p}$ .
$T_{\infty}$ (Span/Critical Path): The execution time on an infinite number of processors. This is the length of the longest path (the critical path) in the task DAG.
$T_{1} / T_{\infty}$ represents the parallelism. The degree of parallelism. “Wider” the better.

Lower Bounds on Execution Time:

$T_{p} \geq T_{1} / p$ : We can’t do better than dividing the total work evenly among the $p$ processors. This is the “work law”.
$T_{p} \geq T_{\infty}$ : We can’t do better than the critical path length, even with infinite processors. This is the “span law”.

Scheduling Task Graphs

The scheduler is a crucial component of a task-parallel system. It’s an algorithm that assigns tasks to processors.

$T_{p}$ (execution time) depends on the scheduler. A good scheduler will minimize $T_{p}$ .
$T_{1} / p$ and $T_{\infty}$ are fixed for a given program and input. They represent inherent properties of the task graph.

Work-Stealing Scheduler

A particularly effective scheduling algorithm is the work-stealing scheduler.

Theorem (Graham, 1968): For any task graph, a greedy scheduler (one that never leaves a processor idle if there’s work to do) achieves an execution time of: $T_{p} \leq T_{1} / p + T_{\infty}$ . This is a strong theoretical bound.
Cilk: The work-stealing scheduler was first used in MIT

ForkJoin Framework & Task Parallel Algorithms

This lecture introduces a powerful tool for writing parallel programs in Java: the ForkJoin Framework. This framework is specifically designed to support the divide-and-conquer style of parallelism that we’ve been discussing.

Why a New Framework?

We’ve seen that while basic Java threads can be used for parallelism, they have significant limitations, especially when dealing with recursive, divide-and-conquer algorithms:

Thread Overhead: Creating and managing a large number of Java threads (which often map to OS threads) is expensive.
Deadlock Potential: Naive use of submit in a recursive setting with a fixed-size thread pool can easily lead to deadlock. The tasks exhaust the threads and block each other.

The ForkJoin Framework addresses these issues by providing a more efficient and manageable way to express task parallelism. The key idea is that when a task is waiting (e.g., for its subtasks to complete), it’s suspended, allowing other tasks to run on the available threads.

Motivation: Divide and Conquer, Revisited

Recall our motivating example: summing the elements of an array recursively. We saw that this approach naturally lends itself to parallelism because the subproblems (summing the two halves of the array) can be solved independently.

The ForkJoin Framework: A Specialized Tool

The ForkJoin Framework, introduced in Java 7 (and significantly improved in Java 8), is specifically designed for divide-and-conquer, fork-join parallelism. It’s part of the java.util.concurrent package.

Purpose: To efficiently execute tasks that can be broken down into smaller, independent subtasks.
Key Features:
- Work-Stealing Scheduler: Provides efficient task scheduling and load balancing.
- Lightweight Tasks: Uses a more lightweight task representation than traditional Java threads.
- Optimized for Recursion: Handles recursive task creation and synchronization effectively.

This lecture will focus on the practical aspects of using the ForkJoin Framework. Similar libraries exist in other languages:

C/C++: Cilk (the original inspiration), Intel’s Thread Building Blocks (TBB)
C#: Task Parallel Library (TPL)

The underlying implementation of these libraries is quite sophisticated, involving techniques like work-stealing, but we’ll focus on how to use the framework effectively.

Shows how the ForkJoin Pool helps distribute and manage threads.

Note that ForkJoin uses lightweight threads.

Key Concepts and Terminology

The ForkJoin Framework introduces some new terminology, but the underlying concepts are similar to what we’ve seen with basic Java threads:

Traditional Threads	ForkJoin Framework	Description
`Thread`	`RecursiveTask<V>`	Represents a task. `RecursiveTask<V>` is used for tasks that return a result of type `V`. `RecursiveAction` is used for tasks that don’t return a result.
`run()`	`compute()`	The main method of a task. This is where you put the code that performs the task’s work. `compute()` in `RecursiveTask<V>` returns a value of type `V`.
`ans` field (for results)	Return value from `compute()`	Instead of storing results in fields, `RecursiveTask<V>` uses the return value of the `compute()` method.
`start()`	`fork()`	Initiates the asynchronous execution of a task. `fork()` adds the task to the work queue of the ForkJoinPool, making it eligible to be executed by a worker thread. It doesn’t immediately start the task in a new thread like `Thread.start()`.
`join()`	`join()`	Waits for a task to complete and returns its result (if it’s a `RecursiveTask`). This is different from `Thread.join()`, which only waits and doesn’t return a value.
`run()` (for hand-tuning)	`compute()` (for hand-tuning)	Instead of calling `run()` directly (which you should almost never do with threads), you can call `compute()` directly to execute the task within the current thread. This is used for optimization (reducing thread creation).
Create a `Thread` and call	Create a `ForkJoinPool` and	The entry point for using the ForkJoin Framework. You create a `ForkJoinPool` and then submit the initial task to it using `invoke()`. `invoke()` blocks until the task (and all its subtasks) are complete, and it returns the final result.
`run`	call `invoke`

More detail regarding the different classes.

More detail regarding the Fork/Join task procedures.

Using the ForkJoin Framework: Recursive Sum Example

Let’s rewrite our recursive array summation using the ForkJoin Framework:

class Globals {
    static ForkJoinPool fjPool = new ForkJoinPool(); // Create a ForkJoinPool
}
 
static long sumArray(int[] array) {
    // Create the initial task and submit it to the pool
    return Globals.fjPool.invoke(new SumForkJoin(array, 0, array.length));
}

ForkJoinPool fjPool = new ForkJoinPool();: We create a ForkJoinPool. The default constructor creates a pool with a number of threads equal to the number of available processors.
fjPool.invoke(...): We create an instance of our SumForkJoin task (which we’ll define next) and submit it to the pool using invoke(). invoke() blocks until the task is complete and returns the final result.

Continue here: 10 Parallel Algorithms (Part I)

CS Notes

Explorer