Lecture from: 19.12.2024 | Video: Videos ETHZ

Finding the Median

The central problem addressed in this section is the efficient computation of the median. The median of a dataset is defined as the middle element when the data is sorted. For a dataset with an even number of elements, the median is typically calculated as the average of the two central elements. It is crucial to distinguish the median from the mean (average); they are distinct statistical measures.

Relevance to Quicksort

The median holds significant importance in statistics, primarily because it provides a robust measure of central tendency. Unlike the mean, which can be significantly skewed by outliers, the median remains relatively stable.

Furthermore, the median plays a crucial role in the efficiency of the Quicksort algorithm, which was previously discussed (See Quicksort). Quicksort relies on selecting a pivot element, ideally the median, to partition the data. When the median is chosen as the pivot, Quicksort achieves its optimal runtime complexity of .

Recall that Quicksort’s worst-case time complexity is . This occurs when the pivot selection consistently results in highly unbalanced partitions (e.g., repeatedly choosing the smallest or largest element). However, if the pivot is chosen randomly, the average runtime complexity is .

The best-case scenario for Quicksort arises when the exact median is selected as the pivot, leading to perfectly balanced partitions and a runtime of . Further analysis, to be covered in a subsequent course (AnW), will demonstrate that selecting a pivot from a range of values near the median can still maintain an average-case time complexity of .

The Need for a Fast Median Algorithm

To guarantee that Quicksort operates with its optimal time complexity, we require an efficient method for finding the median. Ideally, this method should have a linear time complexity, denoted as . This requirement stems from the recurrence relation analysis of Quicksort, which indicates that even algorithms with complexities like would be insufficient to achieve the desired overall performance.

Let’s examine several potential approaches and their corresponding time complexities:

Idea 1: Pairwise Comparisons

A naive approach might involve comparing each element to all others, counting how many elements are larger or smaller. However, this method inherently possesses a time complexity of . This is because, in the worst-case scenario, each element must be compared to every other element.

Idea 2: Balanced Search Tree

Another possibility is to construct a balanced search tree, such as an AVL tree. Once the tree is built, the median can be easily retrieved (e.g., as the root element in a balanced tree). However, the construction of a balanced search tree itself requires time, due to the necessary balancing operations. This approach does not satisfy the linear time requirement.

Idea 3: Sorting

A straightforward approach would be to sort the entire array, after which the median can be directly accessed. While conceptually simple, this method defeats the purpose, as sorting itself typically requires time, exceeding our desired linear complexity.

Idea 4: Pseudopolynomial Approach (Bucketsort Variation)

A more promising direction involves adapting a pseudopolynomial algorithm, similar to Bucketsort. Instead of sorting based on keys, elements would be distributed into buckets based on their values. The efficiency of this approach would be heavily dependent on the range of values present in the array. While potentially offering near-linear performance under specific conditions, this method would inherently be pseudopolynomial, meaning its runtime depends on the magnitude of the values, not just the number of elements.

Quickselect

We generalize the median problem to finding the -th smallest element in an unsorted array, for any given . This is the selection problem.

Formal Problem Definition

Given:

  • An unsorted array of elements.
  • An integer ().

Find:

  • The -th smallest element in (the element at index in a sorted ).

Note: The median is a special case. For odd , . For even , it involves averaging the elements at and .

Quickselect Algorithm

Quickselect, inspired by Quicksort, solves the selection problem efficiently using recursion and partitioning:

  1. Pivot Selection: Choose a pivot element from . (Pivot selection strategies will be discussed later).

  2. Partitioning: Rearrange around such that elements less than are placed before it, and elements greater than are after it. Let be the index of after partitioning. Formally:

    • for all
    • for all The partition subroutine determines the correct sorted position, , of .
  3. Recursive Calls: Based on and :

    • Case 1 (): Return (it’s the -th smallest).
    • Case 2 (): Recursively call (search the right subarray, adjusting the rank).
    • Case 3 (): Recursively call (search the left subarray).

This “divide and conquer” approach reduces the problem size at each step. Partitioning eliminates a portion of the array, focusing only on the subarray containing the -th smallest element. Recursion continues until , at which point the element is found.

Pivot Selection

The choice of pivot significantly impacts Quickselect’s performance. A consistently poor pivot choice (e.g., always the smallest or largest element) leads to a worst-case runtime of , similar to Quicksort’s worst case. However, as with Quicksort, a “good enough” pivot, even if not the exact median, can still ensure efficiency.

Formalization

To formalize this, consider an array of size . Let represent a fraction representing the proportion of elements at either end of the sorted array that we want to avoid selecting as the pivot. A “good” pivot, then, is one that falls within the central portion of the sorted array. The worst-case size of the subproblem, within this “good” range, is then at most . Let .

We can express the recurrence relation for the runtime, , when consistently choosing a “good” pivot as:

where is a constant representing the time taken for partitioning and other operations. Telescoping this recurrence (repeatedly substituting):

This is a geometric series. Since , the series converges:

Since and are constants (independent of ), we have:

Therefore, even with a pivot that isn’t precisely the median, but consistently falls within a defined central range of the sorted array, Quickselect achieves a linear expected runtime. This is because the problem size is reduced by a constant factor in each recursive step.

Choosing the Pivot (Median of Medians)

The remaining challenge is to efficiently find a “good” pivot deterministically. The algorithm described here is due to Blum, Floyd, Pratt, Rivest, and Tarjan. It modifies the pivot selection step (step 1) of the standard Quickselect algorithm.

We modify the algorithm with a new Step 1 (Pivot Selection):

  1. Median of Medians Pivot Selection:
    • a. Group into Fives: Divide the input array into groups of five elements each. If the number of elements, , is not divisible by 5, the last group may have fewer than five elements. This is acceptable.
    • b. Find Group Medians: For each group of five (or fewer) elements, find its median. This can be done efficiently using a constant number of comparisons since each group has a fixed, small size (at most 5). Store these medians in a new array, .
    • c. Recursive Call: Recursively call Quickselect on the array of medians to find the median of . Specifically, call . The result of this recursive call is the pivot, , that will be used in the main Quickselect algorithm’s partitioning step.

The rest of the Quickselect algorithm (partitioning and recursive calls based on the pivot’s position) remains unchanged. This median-of-medians approach provides a deterministic way to select a pivot that guarantees a good split, leading to a worst-case linear time complexity for the overall selection problem.

How good is ?

Let be the pivot chosen by the median-of-medians algorithm. We conjecture that is a “good” pivot in the sense defined earlier, meaning it’s neither too close to the smallest nor the largest element in the sorted array.

Let be the array of medians found in step 1b. Since is the median of , at least half of the elements in are less than or equal to . , so at least elements in are less than or equal to .

Each element in that is less than or equal to is itself the median of a group of five elements in the original array . Therefore, for each such median, there are at least three elements in its group (including itself) that are less than or equal to (the median and the two elements smaller than it). The exception is the possible last group with less than 5 elements. However, its size is bounded and does not significantly affect the overall analysis as grows.

Thus, there are at least elements in that are less than or equal to . By a similar argument, there are at least elements in that are greater than or equal to . This implies that the largest subproblem size after partitioning is at most . So, we have .

Runtime

Let’s derive the runtime complexity, , of Quickselect with the median-of-medians pivot selection.

The recurrence relation is:

This is because:

  • : The recursive call to find the median of medians (finding ) operates on an array of size at most (the array ).
  • : The recursive call on the larger partition after partitioning around operates on an array of size at most (as shown above).
  • : The time to divide into groups of five, find the median of each group, and partition around is linear in .

We conjecture that for some constant , i.e., . We will prove this using strong induction.

Base Case: For , . The base case holds.

Inductive Step: Assume that for all . We want to show that .

Using the recurrence relation and the inductive hypothesis:

Thus, , completing the inductive step.

Therefore, by strong induction, for all , which means . The median-of-medians Quickselect algorithm has a worst-case linear time complexity.

Why Groups of Five?

A natural question is why we chose groups of five in the median-of-medians algorithm. The choice is crucial for achieving the linear time complexity. The key lies in the coefficients of the recursive terms in the recurrence relation, specifically that their sum must be less than 1.

In the analysis with groups of five, we had:

The crucial part is that . This sum being less than 1 is essential for the overall runtime to be . If the sum were equal to or greater than 1, the telescoping method would lead to a logarithmic or higher complexity.

Let’s examine what happens if we use groups of three instead:

  1. Groups of Three: We divide the array into groups of three.

  2. Medians Array: The array of medians will have size .

  3. Elements Less Than/Greater Than p: The median of medians, , will have at least medians in less than or equal to it. Each of these medians represents two elements in (itself and one smaller element) that are less than or equal to . Thus, there are at least elements less than or equal to . Similarly, there are at least elements greater than or equal to p.

  4. Largest Subproblem: The size of largest partition passed into recursion is . So .

  5. Recurrence: This leads to the recurrence relation:

Here, . When we telescope this recurrence, the terms do not decrease geometrically. Instead, we get a sum of the form for approximately levels of recursion. This results in a runtime complexity of , not .

Using groups of seven (or any larger odd constant) would work. The sum of the coefficients in the recurrence would be less than 1. For example, with groups of seven, you’d get a recurrence like , and .

However, using a group size of (the entire array) would be equivalent to directly sorting the array to find the median, which defeats the purpose and leads to an algorithm. The efficiency of the median-of-medians approach comes from finding an approximate median quickly (in linear time) and using that to partition the array. Finding the exact median of the entire array upfront is too expensive. The constant number of operations for finding the median of a constant-size group is key to maintaining linear overall performance. The choice of 5 is the smallest odd integer that guarantees the desired linear time complexity.

This was the last lecture…