The singular value decomposition (SVD) is a powerful matrix decomposition. It is used all the time in statistics and numerical linear algebra. The SVD is at the heart of the principal component analysis, it demonstrates what’s going on in ridge regression and it is one way to construct the Moore-Penrose inverse of a matrix. For more SVD love, see the tweets below.

In this post I’ll define the SVD and prove that it always exists. At the end we’ll look at some pictures to better understand what’s going on.

### Definition

Let be a matrix. We will define the singular value decomposition first in the case . The SVD consists of three matrix and such that . The matrix is required to be diagonal with non-negative diagonal entries . These numbers are called the singular values of . The matrices and are required to orthogonal matrices so that , the identity matrix. Note that since is square we also have however we won’t have unless .

In the case when , we can define the SVD of in terms of the SVD of . Let and be the SVD of so that . The SVD of is then given by transposing both sides of this equation giving and .

### Construction

The SVD of a matrix can be found by iteratively solving an optimisation problem. We will first describe an iterative procedure that produces matrices and . We will then verify that and satisfy the defining properties of the SVD.

We will construct the matrices and one column at a time and we will construct the diagonal matrix one entry at a time. To construct the first columns and entries, recall that the matrix is really a linear function from to given by . We can thus define the operator norm of via

where represents the Euclidean norm of and is the Euclidean norm of . The set of vectors is a compact set and the function is continuous. Thus, the supremum used to define is achieved at some vector . Define . If , then define . If , then define to be an arbitrary vector in with . To summarise we have

We have now started to fill in our SVD. The number is the first singular value of and the vectors and will be the first columns of the matrices and respectively.

Now suppose that we have found the first singular values and the first columns of and . If , then we are done. Otherwise we repeat a similar process.

Let and be the first columns of and . The vectors split into two subspaces. These subspaces are and , the orthogonal compliment of . By restricting to we get a new linear map . Like before, the operator norm of is defined to be

.

Since we must have

The set is a compact set and thus there exists a vector such that . As before define and if . If , then define to be any vector in that is orthogonal to .

This process repeats until eventually and we have produced matrices and . In the next section, we will argue that these three matrices satisfy the properties of the SVD.

### Correctness

The defining properties of the SVD were given at the start of this post. We will see that most of the properties follow immediately from the construction but one of them requires a bit more analysis. Let , and be the output from the above construction.

First note that by construction are orthogonal since we always had . It follows that the matrix is orthogonal and so .

The matrix is diagonal by construction. Furthermore, we have that for every . This is because both and were defined as maximum value of over different subsets of . The subset for contained the subset for and thus .

We’ll next verify that . Since is orthogonal, the vectors form an orthonormal basis for . It thus suffices to check that for . Again by the orthogonality of we have that , the standard basis vector. Thus,

Above, we used that was a diagonal matrix and that is the column of . If , then by definition. If , then and so also. Thus, in either case, and so .

The last property we need to verify is that is orthogonal. Note that this isn’t obvious. At each stage of the process, we made sure that . However, in the case that , we simply defined . It is not clear why this would imply that is orthogonal to .

It turns out that a geometric argument is needed to show this. The idea is that if was not orthogonal to for some , then couldn’t have been the value that maximises .

Let and be two columns of with and . We wish to show that . To show this we will use the fact that and are orthonormal and perform “polar-interpolation“. That is, for , define

Since and are orthogonal, we have that

Furthermore is orthogonal to . Thus, by definition of ,

By the linearity of and the definitions of ,

.

Since and , we have

Rearranging and dividing by gives,

for all

Taking gives . Performing the same polar interpolation with shows that and hence .

We have thus proved that is orthogonal. This proof is pretty “slick” but it isn’t very illuminating. To better demonstrate the concept, I made an interactive Desmos graph that you can access here.

This graph shows example vectors . The vector is fixed at and a quarter circle of radius is drawn. Any vectors that are outside this circle have .

The vector can be moved around inside this quarter circle. This can be done either cby licking and dragging on the point or changing that values of and on the left. The red curve is the path of

.

As goes from to , the path travels from to .

Note that there is a portion of the red curve near that is outside the black circle. This corresponds to a small value of that results in contradicting the definition of . By moving the point around in the plot you can see that this always happens unless lies exactly on the y-axis. That is, unless is orthogonal to .