28 SVD Theorem and Proof, SVD Abstraction, Pseudoinverse

Lecture from 20.12.2024 | Video: Videos ETHZ

Singular Value Decomposition

Definition: Singular Value Decomposition (SVD)

Recall from last lecture…

Let $A \in R^{m \times n}$ be any matrix. A singular value decomposition of $A$ is a factorization of the form:

A = U Σ V^{T}

where:

$U \in R^{m \times m}$ is an orthogonal matrix whose columns are called the left singular vectors of $A$ .
$V \in R^{n \times n}$ is an orthogonal matrix whose columns are called the right singular vectors of $A$ .
$Σ \in R^{m \times n}$ is a diagonal matrix with non-negative entries $σ_{1} \geq σ_{2} \geq \dots \geq σ_{m i n {m, n}} \geq 0$ on the diagonal, called the singular values of $A$ .

Okay, here’s a complete proof of the Singular Value Decomposition (SVD) theorem, first for the general (full) SVD and then for the compact SVD.

Note that the lecturer looked at the case where $m = n, r ank (A) = n$ , however below is the generalized proof for the complete SVD and the compact SVD…

SVD Theorem

Every matrix $A \in R^{m \times n}$ has a singular value decomposition of the form:

A = U Σ V^{T}

where:

$U \in R^{m \times m}$ is an orthogonal matrix whose columns are the left singular vectors of $A$ .
$V \in R^{n \times n}$ is an orthogonal matrix whose columns are the right singular vectors of $A$ .
$Σ \in R^{m \times n}$ is a diagonal matrix with non-negative entries $σ_{1} \geq σ_{2} \geq \dots \geq σ_{m i n {m, n}} \geq 0$ on the diagonal, called the singular values of $A$ .

Proof of the Full SVD

Spectral Decomposition of $A A^{T}$ : Consider the symmetric matrix $A A^{T} \in R^{m \times m}$ . By the spectral theorem, it has a complete set of orthonormal eigenvectors and can be decomposed as:
$A A^{T} = U Λ U^{T}$
where $U \in R^{m \times m}$ is an orthogonal matrix whose columns are the eigenvectors of $A A^{T}$ , and $Λ \in R^{m \times m}$ is a diagonal matrix containing the eigenvalues of $A A^{T}$ .
Ordering Eigenvalues: Arrange the eigenvalues in $Λ$ in decreasing order: $λ_{1} \geq λ_{2} \geq \dots \geq λ_{m} \geq 0$ . Note that the eigenvalues are non-negative because $A A^{T}$ is positive semidefinite.
Rank and Non-zero Eigenvalues: Let $r = rank (A)$ . Then, $A A^{T}$ also has rank $r$ , and exactly $r$ eigenvalues are non-zero: $λ_{1} \geq \dots \geq λ_{r} > 0$ and $λ_{r + 1} = \dots = λ_{m} = 0$ .
Singular Values: Define the singular values $σ_{i} = λ_{i}$ for $i = 1, \dots, min {m, n}$ . Note that only the first $r$ singular values are non-zero.
Constructing $Σ$ : Let $Σ \in R^{m \times n}$ be a diagonal matrix with the singular values $σ_{1}, \dots, σ_{m i n {m, n}}$ on the diagonal, arranged in decreasing order.
Defining $V_{r}$ : Let $U_{r} \in R^{m \times r}$ be the matrix containing the first $r$ columns of $U$ (eigenvectors of $A A^{T}$ corresponding to the non-zero eigenvalues). Define:
$V_{r} := A^{T} U_{r} Σ_{r}^{- 1} \in R^{n \times r}$
where $Σ_{r} \in R^{r \times r}$ is the diagonal matrix containing only the first $r$ non-zero singular values.
Orthogonality of $V_{r}$ : We need to show that $V_{r}^{T} V_{r} = I_{r}$ :
$V_{r}^{T} V_{r} = (Σ_{r}^{- 1})^{T} U_{r}^{T} A A^{T} U_{r} Σ_{r}^{- 1} = Σ_{r}^{- 1} U_{r}^{T} (U_{r} Λ_{r} U_{r}^{T}) U_{r} Σ_{r}^{- 1} = Σ_{r}^{- 1} Λ_{r} Σ_{r}^{- 1} = I_{r}$
Extending $V_{r}$ to an Orthogonal Matrix $V$ : Since $V_{r}$ has orthonormal columns, we can extend it to a full orthogonal matrix $V \in R^{n \times n}$ by adding $n - r$ orthonormal columns that are also orthogonal to the columns of $V_{r}$ . We can use the Gram-Schmidt process for this.
Showing $A V = U Σ$ : We’ll show that $A V = U Σ$ by demonstrating that their corresponding columns are equal.
- For $i = 1, \dots, r$ : $A v_{i} = A (A^{T} u_{i} σ_{i}^{- 1}) = (A A^{T}) u_{i} σ_{i}^{- 1} = u_{i} λ_{i} σ_{i}^{- 1} = u_{i} σ_{i}$
- For $i = r + 1, \dots, n$ : We need to show that $A v_{i} = 0$ . We know $v_{i}$ is orthogonal to the columns of $V_{r}$ , which are linear combinations of the columns of $A^{T}$ . This means that $v_{i}$ is orthogonal to every vector in the column space of $A^{T}$ , which is equivalent to being in the null space of $A$ . Thus, $A v_{i} = 0$ .
Combining these, we see that $A V = U Σ$ .
Final Step: Since $V$ is orthogonal, $V^{T} V = I$ . Multiplying both sides of $A V = U Σ$ by $V^{T}$ on the right, we get:
$A V V^{T} = U Σ V^{T} ⟹ A = U Σ V^{T}$

This completes the proof of the full SVD.

Compact SVD Theorem

Every matrix $A \in R^{m \times n}$ of rank $r$ has a compact singular value decomposition of the form:

A = U_{r} Σ_{r} V_{r}^{T}

where:

$U_{r} \in R^{m \times r}$ contains the first $r$ left singular vectors of $A$ .
$V_{r} \in R^{n \times r}$ contains the first $r$ right singular vectors of $A$ .
$Σ_{r} \in R^{r \times r}$ is a diagonal matrix containing the first $r$ (non-zero) singular values of $A$ .

Proof of the Compact SVD

The proof of the compact SVD follows directly from the proof of the full SVD. We have already constructed $U_{r}$ , $Σ_{r}$ , and $V_{r}$ in steps 1-7 above. We also showed that $V_{r}^{T} V_{r} = I_{r}$ .

In step 9, we showed that $A v_{i} = u_{i} σ_{i}$ for $i = 1, \dots, r$ . This can be written in matrix form as:

A V_{r} = U_{r} Σ_{r}

Multiplying both sides on the right by $V_{r}^{T}$ , and using the fact that $V_{r}^{T} V_{r} = I_{r}$ , we get:

A V_{r} V_{r}^{T} = U_{r} Σ_{r} V_{r}^{T}

From the full SVD proof, we know that $A = U Σ V^{T}$ . By considering only the first $r$ columns of $U$ , $V$ and non-zero entries in $Σ$ , and using the fact that the remaining columns of $V$ are in the nullspace of $A$ , we can write $A = U_{r} Σ_{r} V_{r}^{T}$ .

This completes the proof of the compact SVD.

Consequence of the SVD: Sum of Rank-1 Matrices

Not part of the lecture, but part of the slides…

Theorem: A rank- $r$ matrix $A \in R^{m \times n}$ can be expressed as a sum of $r$ rank-1 matrices.

Proof: This follows directly from the compact SVD:

A = U_{r} Σ_{r} V_{r}^{T} = ∣ u_{1} ∣ \dots ∣ u_{r} ∣ σ_{1} ⋱ σ_{r} - - v_{1}^{T} ⋮ v_{r}^{T} - - = k = 1 \sum r σ_{k} u_{k} v_{k}^{T}

Each outer product $u_{k} v_{k}^{T}$ is a rank-1 matrix, and the SVD expresses $A$ as a weighted sum of these rank-1 matrices, where the weights are the singular values.

This decomposition is fundamental in many applications, including low-rank matrix approximation, dimensionality reduction, and data compression. The SVD provides a powerful and insightful way to understand the structure of any matrix.

Further Remarks on the SVD

Abstraction and Change of Basis: Understanding the Motivation for SVD (Remark 1)

Let’s delve into the core idea behind the Singular Value Decomposition (SVD) by exploring how a linear transformation can be represented differently under a change of basis. This provides a powerful way to understand the motivation and significance of the SVD.

Representing Linear Transformations with Respect to Different Bases

Consider a linear transformation $T : R^{n} \to R^{m}$ represented by the matrix $A \in R^{m \times n}$ . This means that for any vector $x \in R^{n}$ , the transformation is computed as $y = A x$ , where $y \in R^{m}$ .

Canonical Basis: Typically, we represent vectors using the standard (canonical) basis. In $R^{n}$ , the canonical basis is ${e_{1}, e_{2}, \dots, e_{n}}$ , where $e_{i}$ is a vector with a 1 in the $i$ -th position and 0s elsewhere. Similarly, the canonical basis in $R^{m}$ is ${e_{1}^{'}, e_{2}^{'}, \dots, e_{m}^{'}}$ . When we write $A x$ , it’s implicitly understood that $x$ is represented in the canonical basis of $R^{n}$ , and the resulting vector $y = A x$ is represented in the canonical basis of $R^{m}$ .
Expanding in the Canonical Basis: We can express the result of the transformation, $A x$ , as a linear combination of the basis vectors of the codomain, $R^{m}$ . Let $(A x)_{i}$ denote the $i$ -th component of $A x$ . Then, we can write:
$A x = i = 1 \sum m (A x)_{i} e_{i}^{'}$
The Role of Abstraction: This formula highlights the role of abstraction. We are expressing the action of the linear transformation in terms of its effect on the components of the input vector, and then using the basis vectors of the codomain to construct the output vector.

Changing the Basis

Now, suppose we want to represent our vectors using different bases. Let:

$V = [v_{1}, \dots, v_{n}]$ be a matrix whose columns form a basis for $R^{n}$ .
$U = [u_{1}, \dots, u_{m}]$ be a matrix whose columns form a basis for $R^{m}$ .

Any vector $x \in R^{n}$ can be expressed as a linear combination of the basis vectors in $V$ :

x = α_{1} v_{1} + α_{2} v_{2} + \dots + α_{n} v_{n} = V α

where $α = α_{1} α_{2} ⋮ α_{n}$ is the vector of coordinates of $x$ with respect to the basis $V$ .

Similarly, any vector $y \in R^{m}$ can be expressed as a linear combination of the basis vectors in $U$ :

y = β_{1} u_{1} + β_{2} u_{2} + \dots + β_{m} u_{m} = U β

where $β = β_{1} β_{2} ⋮ β_{m}$ is the vector of coordinates of $y$ with respect to the basis $U$ .

Finding a New Representation for the Linear Transformation

Our goal is to find a matrix $B \in R^{m \times n}$ that represents the same linear transformation as $A$ , but with respect to the new bases $V$ and $U$ . In other words, we want a matrix $B$ such that if $x = V α$ and $A x = y = U β$ , then:

B α = β

This means that $B$ takes the coordinates of $x$ in the basis $V$ and produces the coordinates of $y = A x$ in the basis $U$ .

We have $y = A x$ , which can be written as $U β = A V α$ . Multiplying both sides by $U^{- 1}$ on the left gives $β = U^{- 1} A V α$ . Since we want $β = B α$ , we can define:

B = U^{- 1} A V

Multiplying both sides on the left by $U$ and on the right by $V^{- 1}$ , we also get the equivalent relationship

A = U B V^{- 1}

The Essence of SVD: The SVD seeks a special form of this relationship where:

$U$ and $V$ are orthogonal matrices (their columns are orthonormal bases). This makes inverting them easy, as $U^{- 1} = U^{T}$ and $V^{- 1} = V^{T}$ .
$B$ is a diagonal matrix (ideally with non-negative entries in decreasing order). This makes the transformation easy to understand, as it simply scales the components along the new basis vectors.

In essence, the SVD aims to find the “nicest” possible bases for the domain and codomain of a linear transformation, such that the transformation’s action is reduced to a simple scaling along the new coordinate axes. This is what makes the SVD such a powerful tool for understanding and manipulating linear transformations and matrices.

SVD and its Connection to the Spectral Theorem (Remark 2)

Let’s explore the connection between the Singular Value Decomposition (SVD) of an arbitrary matrix $A$ and the spectral theorem applied to the symmetric matrices $A A^{T}$ and $A^{T} A$ . This connection provides a deeper understanding of the SVD and its relationship to eigenvalues and eigenvectors.

Starting with the SVD

Let $A \in R^{m \times n}$ be an arbitrary matrix, and let its SVD be given by:

A = U Σ V^{T}

where $U \in R^{m \times m}$ and $V \in R^{n \times n}$ are orthogonal matrices, and $Σ \in R^{m \times n}$ is a diagonal matrix with the singular values of $A$ on its diagonal.

Transpose of A

Taking the transpose of $A$ , we get:

A^{T} = (U Σ V^{T})^{T} = V Σ^{T} U^{T}

Forming $A A^{T}$

Now, let’s consider the product $A A^{T}$ :

A A^{T} = (U Σ V^{T}) (V Σ^{T} U^{T}) = U Σ (V^{T} V) Σ^{T} U^{T}

Since $V$ is orthogonal, $V^{T} V = I$ . Thus:

A A^{T} = U Σ Σ^{T} U^{T}

Observations

Eigenvectors of $A A^{T}$ : The equation $A A^{T} = U (Σ Σ^{T}) U^{T}$ has the form of an eigendecomposition. The columns of $U$ are the eigenvectors of $A A^{T}$ .
Eigenvalues of $A A^{T}$ : The matrix $Σ Σ^{T} \in R^{m \times m}$ is a diagonal matrix. Its diagonal entries are the squares of the singular values of $A$ ( $σ_{i}^{2}$ ), which are also the eigenvalues of $A A^{T}$ . If $m > n$ , then the last $m - n$ diagonal entries of $Σ Σ^{T}$ are zero.
Spectral Theorem: This decomposition of $A A^{T}$ is precisely the spectral theorem applied to the symmetric matrix $A A^{T}$ . It confirms that $A A^{T}$ has a complete set of orthonormal eigenvectors (the columns of $U$ ) and real, non-negative eigenvalues (the diagonal entries of $Σ Σ^{T}$ ).

Forming $A^{T} A$

Similarly, let’s consider the product $A^{T} A$ :

A^{T} A = (V Σ^{T} U^{T}) (U Σ V^{T}) = V Σ^{T} (U^{T} U) Σ V^{T}

Since $U$ is orthogonal, $U^{T} U = I$ . Thus:

A^{T} A = V Σ^{T} Σ V^{T}

Observations

Eigenvectors of $A^{T} A$ : The equation $A^{T} A = V (Σ^{T} Σ) V^{T}$ is an eigendecomposition of $A^{T} A$ . The columns of $V$ are the eigenvectors of $A^{T} A$ .
Eigenvalues of $A^{T} A$ : The matrix $Σ^{T} Σ \in R^{n \times n}$ is a diagonal matrix. Its diagonal entries are also the squares of the singular values of $A$ ( $σ_{i}^{2}$ ), which are the eigenvalues of $A^{T} A$ . If $n > m$ , then the last $n - m$ diagonal entries of $Σ^{T} Σ$ are zero.
Spectral Theorem: This decomposition of $A^{T} A$ is the spectral theorem applied to the symmetric matrix $A^{T} A$ . It confirms that $A^{T} A$ has a complete set of orthonormal eigenvectors (the columns of $V$ ) and real, non-negative eigenvalues (the diagonal entries of $Σ^{T} Σ$ ).

A Generalized Notion of Eigenvectors

The SVD, through its connection to $A A^{T}$ and $A^{T} A$ , provides a generalization of the concept of eigenvectors. While eigenvectors are traditionally defined for square matrices, the singular vectors of a general matrix $A$ (which can be rectangular) can be thought of as analogous to eigenvectors in the following sense:

The left singular vectors (columns of $U$ ) are eigenvectors of $A A^{T}$ .
The right singular vectors (columns of $V$ ) are eigenvectors of $A^{T} A$ .
The singular values of $A$ are the square roots of the non-zero eigenvalues of both $A A^{T}$ and $A^{T} A$ .

In other words, the SVD identifies two sets of orthonormal vectors (left and right singular vectors) that are related through the action of $A$ and $A^{T}$ , and it provides a set of scaling factors (singular values) that quantify the “stretching” or “shrinking” effect of the linear transformation along these directions. This provides a more general framework for understanding how a linear transformation acts on vectors, even when the matrix is not square or symmetric.

Here are some additional remarks on the properties and applications of the Singular Value Decomposition (SVD), expanding on the concepts presented in the lecture.

SVD and the (Pseudo)inverse (Remark 3)

Invertible Matrices

If $A \in R^{n \times n}$ is invertible and has an SVD $A = U Σ V^{T}$ , then its inverse can be easily computed using the SVD components:

A^{- 1} = V Σ^{- 1} U^{T}

Explanation

Since $A$ is invertible, all its singular values are non-zero, and $Σ$ is a square, diagonal matrix with non-zero entries. Thus, $Σ^{- 1}$ is simply the diagonal matrix with the reciprocals of the singular values on the diagonal. We can verify that this is the inverse:

A A^{- 1} = (U Σ V^{T}) (V Σ^{- 1} U^{T}) = U Σ (V^{T} V) Σ^{- 1} U^{T} = U Σ Σ^{- 1} U^{T} = U U^{T} = I

A^{- 1} A = (V Σ^{- 1} U^{T}) (U Σ V^{T}) = V Σ^{- 1} (U^{T} U) Σ V^{T} = V Σ^{- 1} Σ V^{T} = V V^{T} = I

Pseudoinverse

The concept of an inverse can be generalized to non-square or singular matrices using the pseudoinverse. The SVD provides a way to define the Moore-Penrose pseudoinverse, denoted by $A^{+}$ . If $A = U Σ V^{T}$ , then:

A^{+} = V Σ^{+} U^{T}

where $Σ^{+}$ is obtained from $Σ$ by taking the reciprocal of each non-zero singular value, leaving the zeros in place, and transposing the matrix.

Example

If $Σ = σ_{1} 00 0 σ_{2} 0 000$ , then $Σ^{+} = 1/ σ_{1} 00 0 1/ σ_{2} 0 000^{T} = 1/ σ_{1} 00 0 1/ σ_{2} 0 000$ . If $Σ = σ_{1} 00 0 σ_{2} 0 000000$ , then $Σ^{+} = 1/ σ_{1} 000 0 1/ σ_{2} 00 0000$ .

Significance: The pseudoinverse is a powerful tool for solving least-squares problems, finding minimum-norm solutions to underdetermined systems, and analyzing matrix properties.

Compact SVD (Remark 4)

If $A \in R^{m \times n}$ has rank $r$ , its SVD can be represented in a compact form:

A = U_{r} Σ_{r} V_{r}^{T}

where:

$U_{r} \in R^{m \times r}$ contains the first $r$ left singular vectors of $A$ (corresponding to the non-zero singular values).
$Σ_{r} \in R^{r \times r}$ is a diagonal matrix containing the $r$ non-zero singular values of $A$ .
$V_{r} \in R^{n \times r}$ contains the first $r$ right singular vectors of $A$ (corresponding to the non-zero singular values).

Example

Let $A$ be a $4 \times 5$ matrix with rank 3. Then the compact SVD of $A$ would have the form:

A = U_{3} Σ_{3} V_{3}^{T} = ∣ u_{1} ∣ ∣ u_{2} ∣ ∣ u_{3} ∣ σ_{1} 00 0 σ_{2} 0 00 σ_{3} - - - v_{1}^{T} v_{2}^{T} v_{3}^{T} - - -

where $U_{3}$ is $4 \times 3$ , $Σ_{3}$ is $3 \times 3$ , and $V_{3}^{T}$ is $3 \times 5$ .

SVD as a Sum of Rank-1 Matrices (Remark 5)

The SVD provides a way to express any matrix as a sum of rank-1 matrices.

Theorem

Let $A \in R^{m \times n}$ have rank $r$ , with non-zero singular values $σ_{1}, \dots, σ_{r}$ , left singular vectors $u_{1}, \dots, u_{r}$ , and right singular vectors $v_{1}, \dots, v_{r}$ . Then $A$ can be expressed as:

A = i = 1 \sum r σ_{i} u_{i} v_{i}^{T}

Explanation

This follows directly from the compact SVD:

A = U_{r} Σ_{r} V_{r}^{T} = ∣ u_{1} ∣ \dots ∣ u_{r} ∣ σ_{1} ⋱ σ_{r} - - v_{1}^{T} ⋮ v_{r}^{T} - -

Expanding the matrix multiplication, we get:

A = σ_{1} u_{1} v_{1}^{T} + σ_{2} u_{2} v_{2}^{T} + \dots + σ_{r} u_{r} v_{r}^{T} = i = 1 \sum r σ_{i} u_{i} v_{i}^{T}

Each term $σ_{i} u_{i} v_{i}^{T}$ is a rank-1 matrix because it is the outer product of two vectors, $u_{i}$ and $v_{i}$ , scaled by the singular value $σ_{i}$ .

Significance

This decomposition shows that any matrix can be built up from a linear combination of simple rank-1 matrices. This is fundamental to many applications, including:

Low-rank approximation: Truncating the sum after $k < r$ terms provides the best rank- $k$ approximation of $A$ in terms of the Frobenius norm and the spectral norm.
Data compression: Storing only the first $k$ singular values and corresponding singular vectors allows for a compressed representation of $A$ .
Dimensionality reduction: The SVD can be used to project data onto a lower-dimensional subspace while preserving as much variance as possible (Principal Component Analysis).

The rest of the lecture didn’t seem useful enough to note down as it was mostly a bit of repetition…

This was the last lecture…

CS Notes

Explorer

28 SVD Theorem and Proof, SVD Abstraction, Pseudoinverse

Singular Value Decomposition

Definition: Singular Value Decomposition (SVD)

SVD Theorem

Proof of the Full SVD

Compact SVD Theorem

Proof of the Compact SVD

Consequence of the SVD: Sum of Rank-1 Matrices

Further Remarks on the SVD

Abstraction and Change of Basis: Understanding the Motivation for SVD (Remark 1)

Representing Linear Transformations with Respect to Different Bases

Changing the Basis

Finding a New Representation for the Linear Transformation

SVD and its Connection to the Spectral Theorem (Remark 2)

Starting with the SVD

Transpose of A

Forming $A A^{T}$

Observations

Forming $A^{T} A$

Observations

A Generalized Notion of Eigenvectors

SVD and the (Pseudo)inverse (Remark 3)

Invertible Matrices

Explanation

Pseudoinverse

Example

Compact SVD (Remark 4)

Example

SVD as a Sum of Rank-1 Matrices (Remark 5)

Theorem

Explanation

Significance

Table of Contents

Graph View

CS Notes

Explorer

28 SVD Theorem and Proof, SVD Abstraction, Pseudoinverse

Singular Value Decomposition

Definition: Singular Value Decomposition (SVD)

SVD Theorem

Proof of the Full SVD

Compact SVD Theorem

Proof of the Compact SVD

Consequence of the SVD: Sum of Rank-1 Matrices

Further Remarks on the SVD

Abstraction and Change of Basis: Understanding the Motivation for SVD (Remark 1)

Representing Linear Transformations with Respect to Different Bases

Changing the Basis

Finding a New Representation for the Linear Transformation

SVD and its Connection to the Spectral Theorem (Remark 2)

Starting with the SVD

Transpose of A

Forming AAT

Observations

Forming ATA

Observations

A Generalized Notion of Eigenvectors

SVD and the (Pseudo)inverse (Remark 3)

Invertible Matrices

Explanation

Pseudoinverse

Example

Compact SVD (Remark 4)

Example

SVD as a Sum of Rank-1 Matrices (Remark 5)

Theorem

Explanation

Significance

Table of Contents

Graph View

Forming $A A^{T}$

Forming $A^{T} A$