01 Intro, Numeric Methods, Problems due to Floating Point Arithmetic, Catastrophic Cancellation, Numeric Differentiation

Authors comment: As most of you probably realized, NumCS is very chaotic and (imo) badly explained. I’m doing my best to distill the lecture in this note, but it’s very much possible that there are errors. If you find any (or even wanna improve the note), write in the comments or to me :D

An Overview of Numerical Topics

Numerical analysis is the study of algorithms for solving the problems of continuous mathematics. It is not a single, monolithic theory but rather a collection of powerful techniques designed to tackle distinct classes of problems that are often intractable to solve analytically. The expertise of a numerical analyst lies in diagnosing a problem’s structure to choose the most appropriate and efficient method from this toolkit.

Let’s begin with a survey of the landscape.

Quadrature (Numerical Integration): The fundamental task is to compute the value of a definite integral, $\int_{Ω} f (x) d x$ . In practical applications, the domain of integration $Ω$ is rarely a simple one-dimensional interval. It could be a high-dimensional space ( $Ω \subset R^{d}$ ), an infinite domain, or a geometrically complex shape, such as those found in computer graphics. The challenge is to devise methods that are both accurate and efficient, as computational cost can grow exponentially with the dimension of the problem.
Interpolation: Given a set of discrete data points, $(x_{0}, y_{0}), (x_{1}, y_{1}), \dots, (x_{n}, y_{n})$ , the objective is to construct a function $f$ that passes exactly through these points, satisfying the condition $f (x_{i}) = y_{i}$ for all $i$ . This is a fundamental tool for creating continuous models from discrete measurements.
Numerical Linear Algebra: While you have likely studied methods for solving linear systems like $A x = b$ , the numerical perspective focuses on efficiency and stability. For an $n \times n$ matrix, the standard Gaussian elimination algorithm has a computational complexity of $O (n^{3})$ operations. For large-scale problems where $n$ can be in the millions, this is computationally infeasible. We must therefore explore more efficient algorithms, whose suitability critically depends on the specific properties (e.g., sparsity, symmetry) of the matrix $A$ . This field also covers essential tools like matrix factorizations ( $A = LU$ , $A = QR$ , $A = U Σ V^{H}$ ) and methods for solving eigenvalue problems ( $A x = λ x$ ).
Linear and Nonlinear Least Squares: Often, a system $A x = b$ has no exact solution. The “least squares” approach seeks the best approximate solution by finding the vector $x$ that minimizes the squared Euclidean norm of the residual error:
$x min ∥ A x - b ∥_{2}^{2}$
This concept extends to the nonlinear world, which forms the bedrock of modern machine learning. The goal is to find a set of model parameters $P$ that minimizes the difference between a model’s prediction $f (P, z)$ and observed data $y$ :
$P min ∥ f (P, z) - y ∥_{2}^{2}$
This is fundamentally an optimization problem and a vast, active area of research.

The Core Concerns of Numerics

Across these diverse topics, a set of fundamental principles and concerns consistently guides our work.

Convergent Sequences: We almost always approximate a continuous problem with a discrete one. It is essential that as we increase the “resolution” of our discretization, our sequence of approximate solutions must converge to the true solution.
Computational Cost: An algorithm’s practical utility is determined by its efficiency. We must analyze the Rechenzeit (runtime) and Speicherverbrauch (memory usage), typically as a function of the problem size, $n$ .
Generality: We aim to develop algorithms that apply to broad classes of problems, not just a single, specific function.
Quantifiability: Our analysis must be rigorous. We need to derive precise mathematical bounds on the error of our approximations and the complexity of our algorithms.

The Reality of Computer Arithmetic

Before exploring advanced algorithms, we must confront a foundational and practical issue: computers do not perform exact arithmetic. Understanding the limitations of computer arithmetic is essential to writing robust numerical code.

Absolute and Relative Error

Given an exact value $\tilde{x}$ and a computed approximation $x$ , we define two primary measures of error:

Absolute Error: $a (x) = ∥ x - \tilde{x} ∥$
Relative Error: $r (x) = \frac{∥ x - x ~ ∥}{∥ x ~ ∥}$ , for $\tilde{x} \neq = 0$ .

The absolute error tells you the magnitude of the mistake, while the relative error puts that mistake in context by comparing it to the magnitude of the true value.

Example: The Problem with Relative Error Near Zero Imagine we are measuring a quantity whose true value is $\tilde{x} = 1 0^{- 15}$ . An algorithm produces an approximation $x = 2 \times 1 0^{- 15}$ .

The absolute error is tiny: $∣ x - \tilde{x} ∣ = 1 0^{- 15}$ .

The relative error is enormous: $\frac{1 0 ^{- 15}}{1 0 ^{- 15}} = 1$ , which corresponds to a 100% error! This happens because dividing by a very small number amplifies the error measure. It highlights that for a computer, zero is a special number, and values close to it require careful handling.

Floating-Point Representation

The source of these issues is that computers cannot represent the infinite set of real numbers. They use a finite subset called floating-point numbers.

A floating-point number is stored in a format analogous to scientific notation:

x = \pm d \cdot B^{E}

Base (B): The number system’s base. For most modern computers, this is binary ( $B = 2$ ). For human-readable examples, we often use decimal ( $B = 10$ ).
Mantissa (d): A number with a fixed number of digits, $m$ , representing the significant digits of the value. It is typically normalized to be in a standard range, like $[1/ B, 1)$ . For example, in base 10, the number 5280 would be normalized to $0.5280 \times 1 0^{4}$ .
Exponent (E): An integer that scales the mantissa, representing the number’s order of magnitude. It is restricted to a finite range, $[e_{min}, e_{ma x}]$ .

Example of Floating-Point Representation

Let’s represent the decimal number $13.75$ in a simplified binary floating-point format.

Convert to Binary: $1 3_{10} = 110 1_{2}$ and $0.7 5_{10} = 0.1 1_{2}$ . So, $13.7 5_{10} = 1101.1 1_{2}$ .

Normalize: We move the binary point so that it’s to the left of the first non-zero digit, adjusting the exponent accordingly. $1101.1 1_{2} = 0.11011 1_{2} \times 2^{4}$ .

Store: The computer would store the mantissa $d = 0.11011 1_{2}$ and the exponent $E = 4$ .

This finite representation has two critical consequences:

Rounding Error: Any number that cannot be represented exactly in this format must be rounded to the nearest available floating-point number.
Non-uniform Spacing: The gap between representable numbers is smaller near zero and grows larger as the magnitude of the numbers increases.

The First Rule of Numerical Computing

Because of rounding errors, you must never test for exact equality with floating-point numbers. A check like if x == 0.0: is unreliable and almost always a bug. The correct approach is to check if the number’s magnitude is smaller than a chosen tolerance: if abs(x) < tol:.

Catastrophic Cancellation (Auslöschung)

In the world of numerical computation, not all errors are created equal. The most insidious and dramatic source of error isn’t a bug in your code or a flaw in the hardware; it’s a mathematical phenomenon called catastrophic cancellation.

At its heart, the problem is this: subtracting two numbers that are nearly equal to each other.

Think about it like this. Imagine you have two very long wooden planks, and you want to find the tiny difference in their lengths. You measure each one with a tape measure that’s accurate to about a millimeter.

Plank A: 5.000 meters (but it could be 5.000 +/- 0.001)
Plank B: 4.998 meters (but it could be 4.998 +/- 0.001)

The calculated difference is 0.002 meters, or 2 millimeters. But what’s the uncertainty? It’s now roughly 0.002 meters as well! Your result is the same size as your potential error. The “signal” (the true difference) has been drowned out by the “noise” (the measurement uncertainty).

In floating-point arithmetic, the “uncertainty” comes from the limited precision of the mantissa (see above). When you subtract two very close numbers, the leading, most significant digits, the ones we trust, cancel each other out. The result is then constructed from the trailing, least significant digits, the “noise” where rounding errors live. The computer then normalizes this result, promoting the noisy digits to the front, thereby catastrophically amplifying the relative error.

A Classic Case: The Quadratic Formula

Let’s see this in action with an example that every high school student knows: finding the roots of a quadratic equation.

Consider the polynomial $p (x) = x^{2} - (c + 1/ c) x + 1 = 0$ . If you solve this algebraically, you’ll find the exact roots are beautifully simple: $x_{1} = c$ and $x_{2} = 1/ c$ . This gives us a perfect “ground truth” to test our numerical methods against.

The standard recipe for finding roots is the quadratic formula:

x = \frac{- b \pm b ^{2} - 4 a c}{2}

For our polynomial, we have $a = 1$ , $b = - (c + 1/ c)$ , and $c_{p o l y} = 1$ . (Let’s call the polynomial’s constant term $c_{p o l y}$ to avoid confusion with our parameter $c$ ).

Now, let’s consider what happens when our parameter $c$ is very large, say $c ≫ 1$ .

The term $b = - (c + 1/ c)$ becomes a large negative number.
The term under the square root, the discriminant, is $b^{2} - 4$ . For a large $c$ , $b^{2}$ is huge, so $b^{2} - 4$ is a number extremely close to $∣ b ∣$ .

This is where the trap is set. To find the two roots, we compute:

$x_{1} = \frac{- b + b ^{2} - 4}{2}$ : Here, we are adding two large, positive numbers. This is numerically stable and works perfectly fine.
$x_{2} = \frac{- b - b ^{2} - 4}{2}$ : Here, we are subtracting two large, nearly identical numbers. This is our catastrophic cancellation.

An Example in Action

Let’s take $c = 1 0^{8}$ . The true roots are $x_{1} = 1 0^{8}$ and $x_{2} = 1 0^{- 8}$ .

Our term $- b = 1 0^{8} + 1 0^{- 8}$ . The term $b^{2} - 4$ will be calculated by the computer as a number incredibly close to $1 0^{8} + 1 0^{- 8}$ .

When we calculate $x_{2}$ , we perform (a huge number) - (a slightly different huge number). In floating-point precision, this might look like: 100000000.00000001 - 100000000.00000000 The result depends entirely on the last, least reliable digits. We lose almost all significant figures of accuracy, and the computed result for $x_{2}$ will be wildly incorrect.

The plot below shows this failure in practice. For large $c$ , the error in the smaller root (computed with subtraction) explodes, while the larger root (computed with addition) remains accurate.

The Fix: A Stable Algorithm via Reformulation

We can’t change the laws of floating-point arithmetic, but we can change our algorithm to avoid the trap. The key is to sidestep the dangerous subtraction.

Remember Vieta’s formulas from algebra? For a quadratic $a x^{2} + b x + c = 0$ , the product of the roots is $x_{1} x_{2} = c / a$ . In our case, $a = 1$ and $c_{p o l y} = 1$ , so we have the simple and powerful relationship:

x_{1} \cdot x_{2} = 1

This gives us a brilliant way out.

The Stable Strategy:

First, calculate the “safe” root using the addition, which we know is stable: $x_{1} = \frac{- b + b ^{2} - 4}{2}$
Then, use Vieta’s formula to find the second root without any subtraction: $x_{2} = \frac{1}{x _{1}}$

This two-step method completely avoids catastrophic cancellation. Division by a large, stable number is a perfectly safe operation.

As the plot below confirms, this reformulated algorithm computes both roots with high accuracy across the entire range of $c$ .

The lesson here is: The underlying mathematics is identical, but the computational recipe matters immensely. A simple algebraic reformulation can turn an unstable method that produces garbage into a robust one that delivers the right answer.

Numerical Differentiation: The Finite and the Infinite

How can a computer, a machine that only knows about discrete numbers and finite steps, find the derivative of a function, a concept built on the infinitely small? This is the core challenge of numerical differentiation.

The Obvious Way (And Why It Fails)

Let’s start with the definition of the derivative we all learned in calculus:

f^{'} (x) = h \to 0 lim \frac{f ( x + h ) - f ( x )}{h}

The most straightforward way to turn this into an algorithm is to simply drop the limit and pick a very, very small number for $h$ . This gives us the forward difference formula:

f^{'} (x) \approx \frac{f ( x + h ) - f ( x )}{h} for some small h ≪ 1

How good is this approximation? We can use Taylor’s theorem, which tells us how to approximate a function around a point:

f (x + h) = f (x) + f^{'} (x) h + \frac{1}{2} f^{''} (c) h^{2} for some c \in [x, x + h]

If we rearrange this to solve for our formula, we can see the error we’re making:

Our Formula \frac{f ( x + h ) - f ( x )}{h} - f^{'} (x) = Truncation Error \frac{1}{2} f^{''} (c) h

This is our truncation error. It’s the piece of the infinite Taylor series we “truncated” or cut off. Since it’s proportional to $h$ , we say the error is of order $h$ , written as $O (h)$ . In theory, this is great news! To get a more accurate answer, just make $h$ smaller.

But reality is not so simple. Let’s try it for $f (x) = e^{x}$ at $x = 0$ (where we know the exact answer is $f^{'} (0) = e^{0} = 1$ ).

Look at that table. It’s a disaster!

For a while, everything works as expected. As we decrease $h$ from $1 0^{- 1}$ to $1 0^{- 8}$ , the relative error gets smaller and smaller.
But then, at $h = 1 0^{- 9}$ , the error suddenly gets worse.
For any $h$ smaller than that, the error explodes, until at $h = 1 0^{- 16}$ , we have a 100% error and our result is pure garbage.

What went wrong? We’ve run headfirst into our old enemy: catastrophic cancellation. The numerator, $f (x + h) - f (x)$ , is the subtraction of two numbers that become nearly identical as $h \to 0$ . This wipes out the significant digits, leaving us with rounding noise.

This reveals a fundamental tension in numerical differentiation:

Truncation Error: The mathematical error from our approximation. It wants a small $h$ .
Rounding Error: The computational error from finite precision. It gets amplified by small $h$ and wants a large $h$ .

The total error is a combination of these two, creating a characteristic “V” shape on a log-log plot (see the plot below). The bottom of the “V” is the best we can do, but the method is fundamentally unstable. We need a better idea.

Idea 1: A Clever Escape to the Complex Plane

What if we could get the derivative without subtraction? It sounds impossible, but a beautiful trick exists if we’re allowed to step into the complex numbers. This is the complex step derivative.

Assume our function $f$ is analytic (infinitely differentiable (aka smooth) and representable by its Taylor series, like exp, sin, cos, etc.). Let’s see what happens when we evaluate it at $x_{0} + ih$ , where $i = - 1$ and $h$ is a small real number.

The Taylor series is:

f (x_{0} + ih) = f (x_{0}) + f^{'} (x_{0}) (ih) + \frac{f ^{''} ( x _{0} )}{2} (ih)^{2} + \frac{f ^{'''} ( x _{0} )}{6} (ih)^{3} + \dots

Now let’s simplify the powers of $i$ : $i^{2} = - 1$ , $i^{3} = - i$ , etc.

f (x_{0} + ih) = f (x_{0}) + ih f^{'} (x_{0}) - \frac{h ^{2}}{2} f^{''} (x_{0}) - i \frac{h ^{3}}{6} f^{'''} (x_{0}) + \dots

Let’s group the terms into a real part (no $i$ ) and an imaginary part (with $i$ ):

f (x_{0} + ih) = Real Part (f (x_{0}) - \frac{h ^{2}}{2} f^{''} (x_{0}) + \dots) + i Imaginary Part (h f^{'} (x_{0}) - \frac{h ^{3}}{6} f^{'''} (x_{0}) + \dots)

Look closely at the imaginary part. It contains the term $h f^{'} (x_{0})$ that we want! We can isolate it by taking the imaginary part of the whole expression:

Im [f (x_{0} + ih)] = h f^{'} (x_{0}) - \frac{h ^{3}}{6} f^{'''} (x_{0}) + \dots

Now, just divide by $h$ :

\frac{Im [ f ( x _{0} + ih )]}{h} = f^{'} (x_{0}) - \frac{h ^{2}}{6} f^{'''} (x_{0}) + \dots

This gives us our formula:

f^{'} (x_{0}) \approx \frac{Im [ f ( x _{0} + ih )]}{h}

Why the Complex Step is Brilliant

No Cancellation: The formula is Im[f(x+ih)] / h. There is no subtraction of nearly equal numbers. We have completely sidestepped the cause of our previous failure.

High Accuracy: The truncation error is now $O (h^{2})$ . This means if you halve $h$ , the error doesn’t just get 2x smaller, it gets 4x smaller. It converges much more quickly.

The plots below show the proof. The forward difference (diffd1) shows the V-shaped error profile. The complex step derivative (diffih) is a straight line of decreasing error until it hits the limit of machine precision. It is vastly more stable and accurate.

Idea 2: Getting More from Reality with Richardson Extrapolation

The complex step is amazing, but what if our function or programming language doesn’t support complex numbers (or simply because doing this might be a pain…)? There’s another, profoundly powerful idea called Richardson Extrapolation that works entirely in the real domain.

The trick is to start with a better, more symmetric formula: the central difference formula. We get this by combining two Taylor expansions:

f (x + h) = f (x) + f^{'} (x) h + \frac{f ^{''} ( x )}{2} h^{2} + \frac{f ^{'''} ( x )}{6} h^{3} + \dots

f (x - h) = f (x) - f^{'} (x) h + \frac{f ^{''} ( x )}{2} h^{2} - \frac{f ^{'''} ( x )}{6} h^{3} + \dots

If we subtract the second equation from the first, the even-powered terms ( $f (x)$ , $f^{''} (x) h^{2}$ , etc.) cancel out perfectly:

f (x + h) - f (x - h) = 2 f^{'} (x) h + \frac{2 f ^{'''} ( x )}{6} h^{3} + \dots

Solving for $f^{'} (x)$ gives us the central difference formula:

f^{'} (x) = d (h) \frac{f ( x + h ) - f ( x - h )}{2 h} - Error Terms c_{2} h^{2} - c_{4} h^{4} - \dots

The coefficients $c_{2 k}$ in the error expansion $f^{'} (x) = d (h) + c_{2} h^{2} + c_{4} h^{4} + \dots$ are given by the general formula (but just think of them as some constants):

c_{2 k} = - \frac{f ^{(2 k + 1)} ( x )}{( 2 k + 1 )!} for k = 1, 2, 3, \dots

This is already better than the forward difference because its truncation error is $O (h^{2})$ . But the real magic is that the error expansion contains only even powers of h. This is the key we can exploit.

Let’s call our approximation $d (h)$ . We have:

f^{'} (x) = d (h) + c_{2} h^{2} + c_{4} h^{4} + \dots

What if we compute the same approximation but with half the step size, $h /2$ ?

f^{'} (x) = d (h /2) + c_{2} (h /2)^{2} + c_{4} (h /2)^{4} + \dots = d (h /2) + \frac{c _{2}}{4} h^{2} + \frac{c _{4}}{16} h^{4} + \dots

Now we have a system of two equations. We can play a simple algebraic game to eliminate the biggest error term, $c_{2} h^{2}$ . Multiply the second equation by 4 and subtract the first:

4 f^{'} (x) - f^{'} (x) = (4 d (h /2) - d (h)) + (c_{2} h^{2} - c_{2} h^{2}) + (\frac{c _{4}}{4} h^{4} - c_{4} h^{4}) + \dots

3 f^{'} (x) = (4 d (h /2) - d (h)) - \frac{3}{4} c_{4} h^{4} + \dots

Rearranging gives us a brand new, spectacular approximation:

f^{'} (x) \approx \frac{4 d ( h /2 ) - d ( h )}{3}

The error of this new formula is $O (h^{4})$ ! We’ve combined two $O (h^{2})$ results to get an $O (h^{4})$ result. This process is called Richardson Extrapolation. We can repeat it again and again, combining $O (h^{4})$ results to get an $O (h^{6})$ result, and so on. This is typically organized in a table called the Richardson Schema.

The Power of Extrapolation

Richardson Extrapolation doesn’t eliminate catastrophic cancellation, but it’s so powerful that it lets us achieve very high accuracy using a relatively large value of $h$ , where cancellation is not yet a problem. It’s a method for “accelerating convergence”, getting a great answer from mediocre ones.

The plots below show how the central difference (diffd2) is already better than the forward difference, but the result after one step of Richardson extrapolation (diffRichardsonV) is orders of magnitude more accurate.

Computational Cost

To compare algorithms, we analyze their asymptotic complexity using Big O notation.

Definition: Big O Notation

We say a function $F (n)$ is $O (G (n))$ if there exists a constant $C > 0$ and a value $n_{*}$ such that for all $n \geq n_{*}$ , we have $F (n) \leq C \cdot G (n)$ . This describes the growth rate of the computational cost for large problem sizes $n$ .

Appendix: Computing with Matrices

The building blocks of many numerical algorithms are matrix and vector operations.

Inner Product (Dot Product): For vectors $b, a \in C^{n}$ , the inner product is $⟨ b, a ⟩ = b^{H} a = \sum_{k = 1}^{n} \overset{ˉ}{b}_{k} a_{k}$ . Note the complex conjugate on the first vector.
Outer Product: The outer product $b a^{H}$ produces an $m \times n$ matrix.
Matrix Multiplication: The entry $(A B)_{ik}$ is the inner product of the $i$ -th row of $A$ with the $k$ -th column of $B$ .

In modern numerical environments like Python with NumPy, these operations are highly optimized. It is crucial to use these built-in, vectorized functions rather than writing explicit loops.

CS Notes

Explorer

01 Intro, Numeric Methods, Problems due to Floating Point Arithmetic, Catastrophic Cancellation, Numeric Differentiation

An Overview of Numerical Topics

The Core Concerns of Numerics

The Reality of Computer Arithmetic

Absolute and Relative Error

Floating-Point Representation

Catastrophic Cancellation (Auslöschung)

A Classic Case: The Quadratic Formula

The Fix: A Stable Algorithm via Reformulation

Numerical Differentiation: The Finite and the Infinite

The Obvious Way (And Why It Fails)

Idea 1: A Clever Escape to the Complex Plane

Idea 2: Getting More from Reality with Richardson Extrapolation

Computational Cost

Appendix: Computing with Matrices

Table of Contents

Graph View

Backlinks