Lecture from: 09.04.2024 | Video: Video ETHZ

Welcome to the exam preparation session for Parallel Programming. Julianne Orel and Finn Heckman will guide you through key topics and example problems to help you prepare for the upcoming exam.

Outline

Here’s a rundown of the topics we’ll cover today:

  • Speedup, Amdahl’s Law, Gustafson’s Law
  • Pipelining Concepts
  • Task Graph Analysis
  • (Break)
  • Threads & Wait/Notify Synchronization
  • Fork/Join Framework

Let’s dive into the first topic.

Speedup, Amdahl, Gustafson

This section focuses on calculating and interpreting performance improvements in parallel systems using fundamental laws.

Problem 1: Gustafson’s Law

Question: Given a program where 20% of the work is non-parallelizable. What is the speedup according to Gustafson’s law on a computer with 11 processors? Show all calculation steps.

Understanding the Concepts:

  • Gustafson’s Law: Focuses on scaled speedup. It assumes that the total execution time is kept constant, and the problem size is allowed to increase with the number of processors. The parallel part of the work scales with the number of processors.
  • f: Fraction of time spent on the non-parallelizable (serial) part in the parallel execution.
  • p: Number of processors.

Formulas & Variables:

  • Speedup Formula (General):
  • Amdahl’s Law: (where f is the serial fraction of the original sequential work)
  • Gustafson’s Law: (where f is the serial fraction of the parallel execution time)
  • Variables:
    • f: fraction of non-parallelizable work (in the context of the specific law).
    • p: number of processors.
    • T_p: execution time on p processors.
    • T_1: execution time on 1 processor.

Solution Steps:

  1. Identify variables:
    • Fraction of non-parallelizable work, .
    • Number of processors, .
  2. Apply Gustafson’s Law formula:

Answer: According to Gustafson’s law, the speedup with 11 processors is 9.

Problem 2: Amdahl’s Law

Question: You are running a program where 75% of the work can be parallelized on an unknown machine. When parallelism is disabled (i.e., running the program sequentially), the runtime is 4 hours. When parallelization is enabled, the runtime is only 2 hours. According to Amdahl’s law, at least how many processors does the machine have? Show all calculation steps.

Understanding the Concepts:

  • Amdahl’s Law: Focuses on fixed problem size. It calculates the theoretical maximum speedup achievable by parallelizing a fixed amount of work, given a certain fraction that must remain serial.
  • f: Fraction of the total sequential work that is non-parallelizable.

Solution Steps:

  1. Identify variables:
    • Parallelizable work = 75%, so non-parallelizable work fraction, .
    • Sequential runtime, hours.
    • Parallel runtime, hours.
    • Number of processors,
  2. Calculate actual speedup:
    • .
  3. Apply Amdahl’s Law formula:
    • We know the actual speedup is 2, and this must be less than or equal to the theoretical maximum given by Amdahl’s law.
    • Multiply both sides by the denominator:
    • Multiply by (assuming ):

Answer: According to Amdahl’s law, the machine must have at least 3 processors.

Problem 3: Interpreting Speedup Graphs (Amdahl)

Question: Consider the plots in Figure 1a-c. The solid line represents an ideal maximum speedup according to Amdahl’s law, while the dashed line represents the actual speedups measured in experiments. Not all plots are correct.

Sub-Question: Which of the three plots (a), (b), or (c) corresponds to a case where the fraction of non-parallelizable work is ? Justify your answer.

Understanding the Concepts:

  • Amdahl’s law predicts that speedup is limited by the serial fraction (). As the number of processors () approaches infinity, the maximum speedup approaches .
  • The ideal Amdahl curve should approach a horizontal asymptote at .
  • Experimental results (dashed line) typically fall below the ideal curve due to overheads but should generally follow the same trend and also be limited by the asymptote.

Solution Steps:

  1. Calculate the theoretical limit: For , the maximum speedup is .
  2. Analyze the graphs:
    • Graph (a): Shows speedup potentially exceeding 8 and continuing to increase significantly, suggesting is very small or zero. The experimental curve diverges upwards, which is generally unrealistic (super-linear speedup is rare).
    • Graph (b): Shows speedup approaching an asymptote well above 8, possibly indicating a smaller .
    • Graph (c): Shows the ideal speedup curve (solid blue) clearly leveling off towards a horizontal asymptote around a value of 8. The experimental curve (dashed red) follows this trend, staying below the ideal curve and also leveling off near 8.
  3. Apply Amdahl’s formula limit:
    • The horizontal asymptote is .
    • For , the asymptote is 8.

Answer: Graph (c) corresponds to the case where , because the ideal Amdahl curve (solid blue line) has a horizontal asymptote at a speedup value of .

Sub-Question: Analogously to the previous question, which of the three plots (a), (b), or (c) corresponds to a case where ? Justify your answer.

Solution Steps:

  1. Apply Amdahl’s formula for f=0:
    • .
  2. Interpret the result: When , the ideal speedup according to Amdahl’s law is . This represents perfect linear speedup. The graph should be a straight line with a slope of 1, passing through the origin (or (1,1) since we start with 1 processor).
  3. Analyze the graphs:
    • Graph (a): Shows a curve steeper than linear, unrealistic.
    • Graph (b): The ideal curve (solid blue) is a straight line representing linear speedup (). The experimental curve falls below, which is expected.
    • Graph (c): Shows speedup leveling off, representing .

Answer: Graph (b) corresponds to the case where , because the ideal Amdahl curve (solid blue line) shows perfect linear speedup ().

Problem 4: Plotting Gustafson’s Law

Question: Plot the ideal speedup according to Gustafson’s law (in dependence of the number of processors) for in the following figure.

Understanding the Concepts:

  • Gustafson’s Law: .
  • This describes a linear relationship between and .

Solution Steps:

  1. Substitute f = 1/2 into the formula:

  2. Calculate points:

    • If , .
    • If , .
    • If , .
    • If , .
  3. Plot the line: Draw a straight line passing through points like (1, 1), (2, 1.5), (4, 2.5), (8, 4.5).

Problem 5: Comparing Machines (Amdahl + Frequency)

Question: Assume we have a program consisting of 50% non-parallelizable work, and the total amount of work is fixed. We have a given fixed budget to buy one of the following two machines:

  • A: 2 processors, frequency 3 GHz.
  • B: 5 processors, frequency 2 GHz. Assuming the current machine has 1 processor running at frequency 1 GHz.
  • compute the speedup of executing the program on machine A.
  • compute the speedup of executing the program on machine B.
  • Which machine will run the program faster? Justify your answer and show all calculation steps.

Understanding the Concepts:

  • Amdahl’s Law: Applies because the work is fixed.
  • Frequency: Affects the absolute execution speed. A higher frequency generally means faster execution for the same amount of work per processor.
  • Combined Effect: We need to consider both the parallel speedup () and the clock frequency () to determine the overall performance improvement relative to the baseline machine. The overall speedup factor is .

Solution Approach:

  1. Use Amdahl’s Law: The problem states the work is fixed, implying Amdahl’s law is appropriate for calculating the parallel speedup component.
  2. Calculate Parallel Speedups () for A and B:
    • Serial fraction .
    • Machine A: .
      • .
    • Machine B: .
      • .
    • (Don’t forget to simplify fractions!)
  3. Compare Overall Performance: Multiply the parallel speedup by the frequency ratio relative to the baseline (1 GHz).
    • Machine A Performance Factor: .
    • Machine B Performance Factor: .
  4. Compare: .

Answer:

  • The speedup on machine A is 4/3.
  • The speedup on machine B is 5/3.
  • Machine A will run the program faster. Although Machine B has more processors and thus a slightly higher parallel speedup (), Machine A’s higher clock frequency outweighs this advantage, leading to a greater overall performance improvement (Overall Factor A = 4 vs Overall Factor B = 10/3).

Speedup, Amdahl, Gustafson: Tips and Tricks

  • Memorize Formulas: Know the formulas for , Amdahl, and Gustafson.
  • Identify the Law: Determine which law applies based on the problem statement:
    • Keywords: “fixed work”, “fixed program” Amdahl.
    • Keywords: “fixed time”, “scaled problem” Gustafson.
  • Identify Variables: Clearly extract p, f, , from the problem description. Remember what f represents in each law.
  • Calculations: The rest is usually basic algebra. Watch out for fraction simplification.
  • Frequency: If frequency is mentioned, remember to factor it in when comparing absolute performance.

Pipelining

Pipelining breaks a task into sequential stages, allowing multiple tasks to be processed concurrently, with different tasks being in different stages at the same time.

Pipelining: Key Concepts

  • Assumption: Typically assume no duplicated stages unless stated otherwise.
  • Latency: The time it takes for a single input to go through all stages from start to finish.
    • If the pipeline is balanced (all stages take the same time), Latency = Sum of all stage times.
    • If unbalanced, the latency for a single item might vary depending on when it enters, but the time from entering the first stage to exiting the last stage once the pipeline is full is the sum of stage times. The overall time until the first item completes is the sum of stage times.
  • Throughput: The rate at which inputs complete the pipeline.
    • At Full Utilization: Throughput is limited by the longest stage: . This assumes the pipeline runs long enough to become full.
    • Including Lead-In/Lead-Out: Calculated as .
  • Balanced Pipeline: A pipeline where all stages take the same amount of time. (Sometimes defined as latency being constant, which implies the first stage is not faster than subsequent stages, preventing pile-ups).
  • Unbalanced Pipeline: Stages take different amounts of time.

Pipelining: Tips and Tricks

  • Understand Latency vs. Throughput: Latency is about one item’s journey; throughput is about the completion rate.
  • Identify Bottlenecks: The longest stage determines throughput at full utilization.
  • Balanced Definition: Be aware of the definition used (equal stage times vs. constant latency). If unsure, state your assumed definition.
  • Draw Diagrams: Use Gantt charts to visualize the pipeline’s execution over time, especially for calculating total time or throughput including lead-in/out.

Task Graph

Task graphs (specifically Directed Acyclic Graphs - DAGs) model parallel computations where nodes represent tasks (with execution times) and edges represent dependencies (a task cannot start until its predecessors finish).

Problem 1: Critical Path

Question: Mark the critical path of the task graph shown. What is the execution time of the critical path?

Understanding the Concepts:

  • Critical Path: The longest path (in terms of total execution time) from a starting node (no incoming edges) to an ending node (no outgoing edges) in the DAG.
  • Execution Time (, Span): The length of the critical path represents the minimum time required to execute the entire graph, even with infinite processors, because tasks along this path must be executed sequentially due to dependencies.

Solution Steps:

  1. Identify all paths: Trace all possible paths from the top node (50) to the bottom node (10).
  2. Calculate path lengths: Sum the execution times of the nodes along each path.
    • Path 1: 50 40 30 40 10 = 170
    • Path 2: 50 40 30 30 10 = 160
    • Path 3: 50 40 20 30 10 = 150
    • Path 4: 50 20 20 30 10 = 130
    • Path 5: 50 20 100 10 = 180
  3. Identify the longest path: The longest path is 50 20 100 10.

Answer: The critical path is 50 20 100 10 (marked in red). Its execution time () is 50 + 20 + 100 + 10 = 180.

Problem 2: Minimum Processors for Fastest Execution

Question: What is the minimum number of processors that are necessary to execute the task graph in Figure 3 as quickly as possible?

Understanding the Concepts:

  • To execute as quickly as possible means finishing in time .
  • The minimum number of processors needed is determined by the maximum width of the DAG – the maximum number of tasks that could be run concurrently at any point in time to achieve the execution time.

Solution Steps:

  1. Simulate execution: Step through the execution time, keeping track of which tasks are running and which become ready. Assume enough processors are available initially.
    • Time 0: Task 50 starts. (1 processor needed)
    • Time 50: Task 50 finishes. Tasks 40 and 20 become ready. Start both. (2 processors needed)
    • Time 70 (50+20): Task 20 finishes. Tasks 20 (middle) and 100 become ready. Task 40 is still running. Start 20 and 100. (3 processors needed: 40, 20, 100)
    • Time 90 (50+40): Task 40 finishes. Task 30 becomes ready. Tasks 20 and 100 are still running. Start 30. (3 processors needed: 30, 20, 100)
    • Time 90 (70+20): Middle Task 20 finishes. Task 30 (bottom) becomes ready. Tasks 30 (left) and 100 are still running. Start 30 (bottom). (3 processors needed: 30, 100, 30)
    • Time 120 (90+30): Left Task 30 finishes. Task 40 becomes ready. Tasks 100 and 30 (bottom) are still running. Start 40. (3 processors needed: 40, 100, 30)
    • … Continue until the critical path finishes at Time 180. The maximum number of processors simultaneously required at any step determines the minimum needed for fastest execution.
  2. Identify peak concurrency: The maximum number of tasks running concurrently during the simulation (or that need to run concurrently along the critical path and any parallel paths) was 3.

Note, that you should add timestamps when the task finishes in order to coordinate better…

Answer: 3 processors are necessary to execute the graph as quickly as possible (in time ).

Problem 3: Maximum Speedup

Question: What is the maximum overall speedup that can be achieved by parallelism when the algorithm is run once, compared to sequential execution?

Understanding the Concepts:

  • Maximum Speedup (Parallelism): The theoretical maximum speedup achievable with infinite processors, calculated as .
  • (Work): The time to execute sequentially, which is the sum of the execution times of all nodes in the graph.
  • (Span): The length of the critical path (calculated in Problem 1).

Solution Steps:

  1. Calculate (Work): Sum all node weights: .
  2. Recall (Span): From Problem 1, .
  3. Calculate Maximum Speedup: .

Answer: The maximum overall speedup is 2.

Task Graph: Tips and Tricks

  • TSE (, Work): Sum of all node weights.
  • Span (): Length of the longest path from start to end. Use brute force (list all paths) if the graph is small, or dynamic programming/critical path algorithms for larger graphs.
  • Max Speedup (): .
  • Minimum Processors for : Find the maximum number of tasks that need to run concurrently at any time slice to achieve the execution. Visualizing with colors or time slices helps.

Threads & Wait/Notify

This section covers synchronization using Java’s intrinsic locks (synchronized) and the wait(), notify(), notifyAll() mechanism for thread coordination.

General Exam Tips for Wait/Notify

  • Read Carefully: Understand the exact requirements, especially which methods need to be thread-safe and what constitutes “unnecessary synchronization.”
  • Exceptions: Usually, exam questions simplify things by stating you don’t need to handle InterruptedException from wait().
  • Over-Synchronization: Avoid synchronizing methods unnecessarily. If a method doesn’t access shared mutable state or doesn’t need to be atomic relative to other methods, synchronizing it might lose points.
  • Syntax: Focus on the logic. Minor syntax errors (like capitalization) are often forgiven if the intent is clear.

Problem 1: ParallelBlockingQueue

Question: Implement a thread-safe version of a given non-thread-safe Queue class, creating a ParallelBlockingQueue. Fill in the skeleton, ensuring methods block appropriately (e.g., dequeue blocks if empty, enqueue might block if bounded - though this example is unbounded). Use synchronized, wait, notify/notifyAll. Do not introduce unnecessary synchronization or add new fields/methods.

Completed Code (Conceptual):

Problem 2: Post Office Simulation

Question: Model a simplified post office. The PostOffice thread serves customers one by one, displaying ticket numbers. Customer threads wait for their specific ticket number to be displayed before proceeding. Complete the skeletons using synchronized, wait, notifyAll. Do not add new fields/methods or handle exceptions.

Threads & Wait/Notify: Checklist

  • What to synchronize on? Usually, the object containing the shared state that multiple threads need to coordinate access to (e.g., the ParallelBlockingQueue instance, the PostOffice instance). Don’t synchronize on primitives or wrapper types like Integer.
  • wait/notify/notifyAll Location: Must be called inside a block synchronized on the same object you are waiting on or notifying about.
  • notify() vs notifyAll(): Use notifyAll() unless you are certain that only one specific waiting thread needs to be woken and can proceed. notifyAll() is generally safer.
  • Notify Promptly: Call notify/notifyAll as soon as the condition a waiting thread might be waiting for could have changed (e.g., after adding to a queue, after incrementing the counter).
  • Wait in a while Loop: Always check the condition in a while loop before and after waiting (while (!condition) { wait(); }). Never use if. This handles “spurious wakeups” and ensures the condition is truly met when the thread proceeds.
  • Trace Execution: If unsure, manually trace the execution step-by-step, considering different possible interleavings.

Fork/Join Framework

This section revisits the Fork/Join framework, focusing on exam-style problems.

Problem: Character Count

Question: You are given a large string and your task is to count the number of occurrences of a given character using the Fork/Join framework. Complete the skeleton implementation.

Understanding the Concepts:

  • Use RecursiveTask<Integer> since we need to return the count (an integer).
  • Apply divide-and-conquer: split the string range, recursively count in sub-ranges, combine results by adding.
  • Use a sequential cutoff for efficiency.

General Remarks

  • Grading: You often don’t need a perfect score for a top grade. Focus on demonstrating understanding of the core concepts.
  • Read Descriptions Carefully: Task descriptions contain crucial details and constraints.
  • Know Formulas: Memorize key formulas (Speedup, Amdahl, Gustafson, Work/Span bounds).
  • Practice: Solve exercises and old exams to solidify understanding and identify common patterns.

Continue here: 15 Memory Reordering, Java Memory Model, Data Races