The singular value decomposition (SVD) is a powerful matrix decomposition. It is used all the time in statistics and numerical linear algebra. The SVD is at the heart of the principal component analysis, it demonstrates what’s going on in ridge regression and it is one way to construct the Moore-Penrose inverse of a matrix. For more SVD love, see the tweets below.
In this post I’ll define the SVD and prove that it always exists. At the end we’ll look at some pictures to better understand what’s going on.
Definition
Let
be a
matrix. We will define the singular value decomposition first in the case
. The SVD consists of three matrix
and
such that
. The matrix
is required to be diagonal with non-negative diagonal entries
. These numbers are called the singular values of
. The matrices
and
are required to orthogonal matrices so that
, the
identity matrix. Note that since
is square we also have
however we won’t have
unless
.
In the case when
, we can define the SVD of
in terms of the SVD of
. Let
and
be the SVD of
so that
. The SVD of
is then given by transposing both sides of this equation giving
and
.
Construction
The SVD of a matrix can be found by iteratively solving an optimisation problem. We will first describe an iterative procedure that produces matrices
and
. We will then verify that
and
satisfy the defining properties of the SVD.
We will construct the matrices
and
one column at a time and we will construct the diagonal matrix
one entry at a time. To construct the first columns and entries, recall that the matrix
is really a linear function from
to
given by
. We can thus define the operator norm of
via

where
represents the Euclidean norm of
and
is the Euclidean norm of
. The set of vectors
is a compact set and the function
is continuous. Thus, the supremum used to define
is achieved at some vector
. Define
. If
, then define
. If
, then define
to be an arbitrary vector in
with
. To summarise we have
We have now started to fill in our SVD. The number
is the first singular value of
and the vectors
and
will be the first columns of the matrices
and
respectively.
Now suppose that we have found the first
singular values
and the first
columns of
and
. If
, then we are done. Otherwise we repeat a similar process.
Let
and
be the first
columns of
and
. The vectors
split
into two subspaces. These subspaces are
and
, the orthogonal compliment of
. By restricting
to
we get a new linear map
. Like before, the operator norm of
is defined to be
.
Since
we must have

The set
is a compact set and thus there exists a vector
such that
. As before define
and
if
. If
, then define
to be any vector in
that is orthogonal to
.
This process repeats until eventually
and we have produced matrices
and
. In the next section, we will argue that these three matrices satisfy the properties of the SVD.
Correctness
The defining properties of the SVD were given at the start of this post. We will see that most of the properties follow immediately from the construction but one of them requires a bit more analysis. Let
,
and
be the output from the above construction.
First note that by construction
are orthogonal since we always had
. It follows that the matrix
is orthogonal and so
.
The matrix
is diagonal by construction. Furthermore, we have that
for every
. This is because both
and
were defined as maximum value of
over different subsets of
. The subset for
contained the subset for
and thus
.
We’ll next verify that
. Since
is orthogonal, the vectors
form an orthonormal basis for
. It thus suffices to check that
for
. Again by the orthogonality of
we have that
, the
standard basis vector. Thus,

Above, we used that
was a diagonal matrix and that
is the
column of
. If
, then
by definition. If
, then
and so
also. Thus, in either case,
and so
.
The last property we need to verify is that
is orthogonal. Note that this isn’t obvious. At each stage of the process, we made sure that
. However, in the case that
, we simply defined
. It is not clear why this would imply that
is orthogonal to
.
It turns out that a geometric argument is needed to show this. The idea is that if
was not orthogonal to
for some
, then
couldn’t have been the value that maximises
.
Let
and
be two columns of
with
and
. We wish to show that
. To show this we will use the fact that
and
are orthonormal and perform “polar-interpolation“. That is, for
, define

Since
and
are orthogonal, we have that

Furthermore
is orthogonal to
. Thus, by definition of
,

By the linearity of
and the definitions of
,
.
Since
and
, we have

Rearranging and dividing by
gives,
for all ![\lambda \in (0,1]](https://s0.wp.com/latex.php?latex=%5Clambda+%5Cin+%280%2C1%5D&bg=ffffff&fg=333333&s=0&c=20201002)
Taking
gives
. Performing the same polar interpolation with
shows that
and hence
.
We have thus proved that
is orthogonal. This proof is pretty “slick” but it isn’t very illuminating. To better demonstrate the concept, I made an interactive Desmos graph that you can access here.
This graph shows example vectors
. The vector
is fixed at
and a quarter circle of radius
is drawn. Any vectors
that are outside this circle have
.
The vector
can be moved around inside this quarter circle. This can be done either cby licking and dragging on the point or changing that values of
and
on the left. The red curve is the path of
.
As
goes from
to
, the path travels from
to
.
Note that there is a portion of the red curve near
that is outside the black circle. This corresponds to a small value of
that results in
contradicting the definition of
. By moving the point
around in the plot you can see that this always happens unless
lies exactly on the y-axis. That is, unless
is orthogonal to
.