How to Bake Pi, Sherman-Morrison and log-sum-exp

A few months ago, I had the pleasure of reading Eugenia Cheng‘s book How to Bake Pi. Each chapter starts with a recipe which Cheng links to the mathematical concepts contained in the chapter. The book is full of interesting connections between mathematics and the rest of the world.

One of my favourite ideas in the book is something Cheng writes about equations and the humble equals sign: =. She explains that when an equation says two things are equal we very rarely mean that they are exactly the same thing. What we really mean is that the two things are the same in some ways even though they may be different in others.

One example that Cheng gives is the equation a + b = b+a. This is such a familiar statement that you might really think that a+b and b+a are the same thing. Indeed, if a and b are any numbers, then the number you get when you calculate a + b is the same as the number you get when you calculate b + a. But calculating a+b could be very different from calculating b+a. A young child might calculate a+b by starting with a and then counting one-by-one from a to a + b. If a is 1 and b is 20, then calculating a + b requires counting from 1 to 21 but calculating b+a simply amounts to counting from 20 to 21. The first process takes way longer than the second and the child might disagree that 1 + 20 is the same as 20 + 1.

In How to Bake Pi, Cheng explains that a crucial idea behind equality is context. When someone says that two things are equal we really mean that they are equal in the context we care about. Cheng talks about how context is crucial through-out mathematics and introduces a little bit of category theory as a tool for moving between different contexts. I think that this idea of context is really illuminating and I wanted to share some examples where “=” doesn’t mean “exactly the same as”.

The Sherman-Morrison formula

The Sherman-Morrison formula is a result from linear algebra that says for any invertible matrix A \in \mathbb{R}^{n\times n} and any pair of vectors u,v \in \mathbb{R}^n, if v^TA^{-1}u \neq -1, then A + uv^T is invertible and

(A + uv^T)^{-1} = A^{-1} + \frac{A^{-1}uv^TA^{-1}}{1 + v^TA^{-1}u}

Here “=” means the following:

  1. You can take any natural number n, any matrix A of size n by n, and any length n-vectors u and v that satisfy the above condition.
  2. If you take all those things and carry out all the matrix multiplications, additions and inversions on the left and all the matrix multiplications, additions and inversions on the right, then you will end up with exactly the same matrix in both cases.

But depending on the context, the equation on one side of “=” may be much easier than the other. Although the right hand side looks a lot more complicated, it is much easier to compute in one important context. This context is when we have already calculated the matrix A^{-1} and now want the inverse of A + uv^T. The left hand side naively computes A + uv^T which takes O(n^3) computations since we have to invert a n \times n matrix. On the right hand side, we only need to compute a small number of matrix-vector products and then add two matrices together. This bring the computational cost down to O(n^2).

These cost saving measures come up a lot when studying linear regression. The Sherman-Morrison formula can be used to update regression coefficients when a new data point is added. Similarly, the Sherman-Morrison formula can be used to quickly calculate the fitted values in leave-one-out cross validation.

log-sum-exp

This second example also has connections to statistics. In a mixture model, we assume that each data point Y comes from a distribution of the form:

p(y|\pi,\theta) = \sum_{k=1}^K \pi_k p(y | \theta_k),

where \pi is a vector and \pi_k is equal to the probability that Y came from class k. The parameters \theta_k \in\mathbb{R}^p are the parameters for the k^{th} group.The log-likelihood is thus,

\log\left(\sum_{k=1}^K \pi_k p(y | \theta_k)\right) = \log\left(\sum_{k=1}^K \exp(\eta_{k})\right),

where \eta_{k} = \log(\pi_k p(y| \theta_k)). We can see that the log-likelihood is of the form log-sum-exp. Calculating a log-sum-exp can cause issues with numerical stability. For instance if K = 3 and \eta_k = 1000, for all k=1,2,3, then the final answer is simply \log(3)+1000. However, as soon as we try to calculate \exp(1000) on a computer, we’ll be in trouble.

The solution is to use the following equality, for any \beta \in \mathbb{R},

\log\left(\sum_{k=1}^K \exp(\eta_k) \right) = \beta + \log\left(\sum_{k=1}^K \exp(\beta - \eta_k)\right).

Proving the above identity is a nice exercise in the laws of logarithm’s and exponential’s, but with a clever choice of \beta we can more safely compute the log-sum-exp expression. For instance, in the documentation for pytorch’s implementation of logsumexp() they take \beta to be the maximum of \eta_k. This (hopefully) makes each of the terms \beta - \eta_k a reasonable size and avoids any numerical issues.

Again, the left and right hand sides of the above equation might be the same number, but in the context of having to use computers with limited precision, they represent very different calculations.

Beyond How to Bake Pi

Eugenia Cheng has recently published a new book called The Joy of Abstraction. I’m just over half way through and it’s been a really engaging and interesting introduction to category theory. I’m looking forward to reading the rest of it and getting more insight from Eugenia Cheng’s great mathematical writing.

Looking back on the blog

Next week I’ll be starting the second year of my statistics PhD. I’ve learnt a lot from taking the first year classes and studying for the qualifying exams. Some of what I’ve learnt has given me a new perspective on some of my old blog posts. Here are three things that I’ve written about before and I now understand better.

1. The Pi-Lambda Theorem

An early post on this blog was titled “A minimal counterexample in probability theory“. The post was about a theorem from the probability course offered at the Australian National University. The theorem states that if two probability measures agree on a collection of subsets and the collection is closed under finite intersections, then the two probability measures agree on the \sigma-algebra generated by the collection. In my post I give an example which shows that you need the collection to be closed under finite intersections. I also show that you need to have at least four points in the space to find such an example.

What I didn’t know then is that the above theorem is really a corollary of Dykin’s \pi - \lambda theorem. This theorem was proved in my first graduate probability course which was taught by Persi Diaconis. Professor Diaconis kept a running tally of how many times we used the \pi - \lambda theorem in his course and we got up to at least 10. (For more appearances by Professor Diaconis on this blog see here, here and here).

If I were to write the above post again, I would talk about the \pi - \lambda theorem and rename the post “The smallest \lambda-system”. The example given in my post is really about needing at least four points to find a \lambda-system that is not a \sigma-algebra.

2. Mallow’s Cp statistic

The very first blog post I wrote was called “Complexity Penalties in Statistical Learning“. I wasn’t sure if I would write a second and so I didn’t set up a WordPress account. I instead put the post on the website LessWrong. I no longer associate myself with the rationality community but posting to LessWrong was straight forward and helped me reach more people.

The post was inspired in two ways by the 2019 AMSI summer school. First, the content is from the statistical learning course I took at the summer school. Second, at the career fair many employers advised us to work on our writing skills. I don’t know if would have started blogging if not for the AMSI Summer School.

I didn’t know it at the time but the blog post is really about Mallow’s Cp statistic. Mallow’s Cp statistic is an estimate of the test error of a regression model fit using ordinary least squares. The Mallow’s Cp is equal to the training error plus a “complexity penalty” which takes into account the number of parameters. In the blog post I talk about model complexity and over-fitting. I also write down and explain Mallow’s Cp in the special case of polynomial regression.

In the summer school course I took, I don’t remember the name Mallow’s Cp being used but I thought it was a great idea and enjoyed writing about it. The next time encountered Mallow’s Cp was in the linear models course I took last fall. I was delighted to see it again and learn how it fit into a bigger picture. More recently, I read Brad Efron’s paper “How Biased is the Apparent Error Rate of a Prediction Rule?“. The paper introduces the idea of “effective degrees of freedom” and expands on the ideas behind the Cp statistic.

Incidentally, enrolment is now open for the 2023 AMSI Summer School! This summer it will be hosted at the University of Melbourne. I encourage any Australia mathematics or statistics students reading this to take a look and apply. I really enjoyed going in both 2019 and 2020. (Also if you click on the above link you can try to spot me in the group photo of everyone wearing read shirts!)

3. Finitely additive probability measures

In “Finitely additive measures” I talk about how hard it is to define a finitely additive measure on the set of natural numbers that is not countably additive. In particular, I talked about needing to use the Hahn — Banach extension theorem to extend the natural density from the collection of sets with density to the collection of all subsets of the natural numbers.

There were a number of homework problems in my first graduate probability course that relate to this post. We proved that the sets with density are not closed under finite unions and we showed that the square free numbers have density 6/\pi^2.

We also proved that any finite measure defined on an algebra of subsets can be extend to the collection of all subsets. This proof used Zorn’s lemma and the resulting measure is far from unique. The use of Zorn’s lemma relates to the main idea in my blog, that defining an additive probability measure is in some sense non-constructive.

Other posts

Going forward, I hope to continue publishing at least one new post every month. I look forward to one day writing another post like this when I can look back and reflect on how much I have learnt.

Sampling from the non-central chi-squared distribution

The non-central chi-squared distribution is a generalisation of the regular chi-squared distribution. The chi-squared distribution turns up in many statistical tests as the (approximate) distribution of a test statistic under the null hypothesis. Under alternative hypotheses, those same statistics often have approximate non-central chi-squared distributions.

This means that the non-central chi-squared distribution is often used to study the power of said statistical tests. In this post I give the definition of the non-central chi-squared distribution, discuss an important invariance property and show how to efficiently sample from this distribution.

Definition

Let Z be a normally distributed random vector with mean 0 and covariance I_n. Given a vector \mu \in \mathbb{R}^n, the non-central chi-squared distribution with n degrees of freedom and non-centrality parameter \Vert \mu\Vert_2^2 is the distribution of the quantity

\Vert Z+\mu \Vert_2^2 = \sum\limits_{i=1}^n (Z_i+\mu_i)^2.

This distribution is denoted by \chi^2_n(\Vert \mu \Vert_2^2). As this notation suggests, the distribution of \Vert Z+\mu \Vert_2^2 depends only on \Vert \mu \Vert_2^2, the norm of \mu. The first few times I heard this fact, I had no idea why it would be true (and even found it a little spooky). But, as we will see below, the result is actually a simply consequence of the fact that standard normal vectors are invariant under rotations.

Rotational invariance

Suppose that we have two vectors \mu, \nu \in \mathbb{R}^n such that \Vert \mu\Vert_2^2 = \Vert \nu \Vert_2^2. We wish to show that if Z \sim \mathcal{N}(0,I_n), then

\Vert Z+\mu \Vert_2^2 has the same distribution as \Vert Z + \nu \Vert_2^2.

Since \mu and \nu have the same norm there exists an orthogonal matrix U \in \mathbb{R}^{n \times n} such that U\mu = \nu. Since U is orthogonal and Z \sim \mathcal{N}(0,I_n), we have Z'=UZ \sim \mathcal{N}(U0,UU^T) = \mathcal{N}(0,I_n). Furthermore, since U is orthogonal, U preserves the norm \Vert \cdot \Vert_2^2. This is because, for all x \in \mathbb{R}^n,

\Vert Ux\Vert_2^2 = (Ux)^TUx = x^TU^TUx=x^Tx=\Vert x\Vert_2^2.

Putting all these pieces together we have

\Vert Z+\mu \Vert_2^2 = \Vert U(Z+\mu)\Vert_2^2 = \Vert UZ + U\mu \Vert_2^2 = \Vert Z'+\nu \Vert_2^2.

Since Z and Z' have the same distribution, we can conclude that \Vert Z'+\nu \Vert_2^2 has the same distribution as \Vert Z + \nu \Vert. Since \Vert Z + \mu \Vert_2^2 = \Vert Z'+\nu \Vert_2^2, we are done.

Sampling

Above we showed that the distribution of the non-central chi-squared distribution, \chi^2_n(\Vert \mu\Vert_2^2) depends only on the norm of the vector \mu. We will now use this to provide an algorithm that can efficiently generate samples from \chi^2_n(\Vert \mu \Vert_2^2).

A naive way to sample from \chi^2_n(\Vert \mu \Vert_2^2) would be to sample n independent standard normal random variables Z_i and then return \sum_{i=1}^n (Z_i+\mu_i)^2. But for large values of n this would be very slow as we have to simulate n auxiliary random variables Z_i for each sample from \chi^2_n(\Vert \mu \Vert_2^2). This approach would not scale well if we needed many samples.

An alternative approach uses the rotation invariance described above. The distribution \chi^2_n(\Vert \mu \Vert_2^2) depends only on \Vert \mu \Vert_2^2 and not directly on \mu. Thus, given \mu, we could instead work with \nu = \Vert \mu \Vert_2 e_1 where e_1 is the vector with a 1 in the first coordinate and 0s in all other coordinates. If we use \nu instead of \mu, we have

\sum\limits_{i=1}^n (Z_i+\nu_i)^2 = (Z_1+\Vert \mu \Vert_2)^2 + \sum\limits_{i=2}^{n}Z_i^2.

The sum \sum_{i=2}^n Z_i^2 follows the regular chi-squared distribution with n-1 degrees of freedom and is independent of Z_1. The regular chi-squared distribution is a special case of the gamma distribution and can be effectively sampled with rejection sampling for large shape parameter (see here).

The shape parameter for \sum_{i=2}^n Z_i^2 is \frac{n-1}{2}, so for large values of n we can efficiently sample a value Y that follows that same distribution as \sum_{i=2}^n Z_i^2 \sim \chi^2_{n-1}. Finally to get a sample from \chi^2_n(\Vert \mu \Vert_2^2) we independently sample Z_1, and then return the sum (Z_1+\Vert \mu\Vert_2)^2 +Y.

Conclusion

In this post, we saw that the rotational invariance of the standard normal distribution gives a similar invariance for the non-central chi-squared distribution.

This invariance allowed us to efficiently sample from the non-central chi-squared distribution. The sampling procedure worked by reducing the problem to sampling from the regular chi-squared distribution.

The same invariance property is also used to calculate the cumulative distribution function and density of the non-central chi-squared distribution. Although the resulting formulas are not for the faint of heart.

The Singular Value Decomposition (SVD)

The singular value decomposition (SVD) is a powerful matrix decomposition. It is used all the time in statistics and numerical linear algebra. The SVD is at the heart of the principal component analysis, it demonstrates what’s going on in ridge regression and it is one way to construct the Moore-Penrose inverse of a matrix. For more SVD love, see the tweets below.

A tweet by Women in Statistics and Data Science about the SVD.
The full thread is here https://twitter.com/WomenInStat/status/1285610321747611653?s=20&t=Elj62mGSc9gINHvbwt82Zw

In this post I’ll define the SVD and prove that it always exists. At the end we’ll look at some pictures to better understand what’s going on.

Definition

Let X be a n \times p matrix. We will define the singular value decomposition first in the case n \ge p. The SVD consists of three matrix U \in \mathbb{R}^{n \times p}, \Sigma \in \mathbb{R}^{p \times p} and V \in \mathbb{R}^{p \times p} such that X = U\Sigma V^T. The matrix \Sigma is required to be diagonal with non-negative diagonal entries \sigma_1 \ge \sigma_2 \ge \ldots \ge \sigma_p \ge 0. These numbers are called the singular values of X. The matrices U and V are required to orthogonal matrices so that U^TU=V^TV = I_p, the p \times p identity matrix. Note that since V is square we also have VV^T=I_p however we won’t have UU^T = I_n unless n = p.

In the case when n \le p, we can define the SVD of X in terms of the SVD of X^T. Let \widetilde{U} \in \mathbb{R}^{p \times n}, \widetilde{\Sigma} \in \mathbb{R}^{n \times n} and \widetilde{V} \in \mathbb{R}^{n \times n} be the SVD of X^T so that X^T=\widetilde{U}\widetilde{\Sigma}\widetilde{V}^T. The SVD of X is then given by transposing both sides of this equation giving U = \widetilde{V}, \Sigma = \widetilde{\Sigma}^T=\widetilde{\Sigma} and V = \widetilde{U}.

Construction

The SVD of a matrix can be found by iteratively solving an optimisation problem. We will first describe an iterative procedure that produces matrices U \in \mathbb{R}^{n \times p}, \Sigma \in \mathbb{R}^{p \times p} and V \in \mathbb{R}^{p \times p}. We will then verify that U,\Sigma and V satisfy the defining properties of the SVD.

We will construct the matrices U and V one column at a time and we will construct the diagonal matrix \Sigma one entry at a time. To construct the first columns and entries, recall that the matrix X is really a linear function from \mathbb{R}^p to \mathbb{R}^n given by v \mapsto Xv. We can thus define the operator norm of X via

\Vert X \Vert = \sup\left\{ \|Xv\|_2 : \|v\|_2 =1\right\},

where \|v\|_2 represents the Euclidean norm of v \in \mathbb{R}^p and \|Xv\|_2 is the Euclidean norm of Xv \in \mathbb{R}^n. The set of vectors \{v \in \mathbb{R} : \|v\|_2 = 1 \} is a compact set and the function v \mapsto \|Xv\|_2 is continuous. Thus, the supremum used to define \Vert X \Vert is achieved at some vector v_1 \in \mathbb{R}^p. Define \sigma_1 = \|X v_1\|_2. If \sigma_1 \neq 0, then define u_1 = Xv_1/\sigma_1 \in \mathbb{R}^n. If \sigma_1 = 0, then define u_1 to be an arbitrary vector in \mathbb{R}^n with \|u\|_2 = 1. To summarise we have

  • v_1 \in \mathbb{R}^p with \|v_1\|_2 = 1.
  • \sigma_1 = \|X\| = \|Xv_1\|_2.
  • u_1 \in \mathbb{R}^n with \|u_1\|_2=1 and Xv_1 = \sigma_1u_1.

We have now started to fill in our SVD. The number \sigma_1 \ge 0 is the first singular value of X and the vectors v_1 and u_1 will be the first columns of the matrices V and U respectively.

Now suppose that we have found the first k singular values \sigma_1,\ldots,\sigma_k and the first k columns of V and U. If k = p, then we are done. Otherwise we repeat a similar process.

Let v_1,\ldots,v_k and u_1,\ldots,u_k be the first k columns of V and U. The vectors v_1,\ldots,v_k split \mathbb{R}^p into two subspaces. These subspaces are S_1 = \text{span}\{v_1,\ldots,v_k\} and S_2 = S_1^\perp, the orthogonal compliment of S_1. By restricting X to S_2 we get a new linear map X_{|S_2} : S_2 \to \mathbb{R}^n. Like before, the operator norm of X_{|S_2} is defined to be

\|X_{|S_2}\| = \sup\{\|X_{|S_2}v\|_2:v \in S_2, \|v\|_2=1\}.

Since S_2 = \text{span}\{v_1,\ldots,v_k\}^\perp we must have

\|X_{|S_2}\| =  \sup\{\|Xv\|_2:v \in \mathbb{R}^p, \|v\|_2=1, v_j^Tv = 0 \text{ for } j=1,\ldots,k\}.

The set \{v \in \mathbb{R}^p : \|v\|_2=1, v_j^Tv=0\text{ for } j=1,\ldots,k\} is a compact set and thus there exists a vector v_{k+1} such that \|Xv_{k+1}\|_2 = \|X_{|S_2}\|. As before define \sigma_{k+1} = \|Xv_{k+1}\|_2 and u_{k+1} = Xv_{k+1}/\sigma_{k+1} if \sigma_{k+1}\neq 0. If \sigma_{k+1} = 0, then define u_{k+1} to be any vector in \mathbb{R}^{n} that is orthogonal to u_1,u_2,\ldots,u_k.

This process repeats until eventually k = p and we have produced matrices U \in \mathbb{R}^{n \times p}, \Sigma \in \mathbb{R}^{p \times p} and V \in \mathbb{R}^{p \times p}. In the next section, we will argue that these three matrices satisfy the properties of the SVD.

Correctness

The defining properties of the SVD were given at the start of this post. We will see that most of the properties follow immediately from the construction but one of them requires a bit more analysis. Let U = [u_1,\ldots,u_p], \Sigma = \text{diag}(\sigma_1,\ldots,\sigma_p) and V= [v_1,\ldots,v_p] be the output from the above construction.

First note that by construction v_1,\ldots, v_p are orthogonal since we always had v_{k+1} \in \text{span}\{v_1,\ldots,v_k\}^\perp. It follows that the matrix V is orthogonal and so V^TV=VV^T=I_p.

The matrix \Sigma is diagonal by construction. Furthermore, we have that \sigma_{k+1} \le \sigma_k for every k. This is because both \sigma_k and \sigma_{k+1} were defined as maximum value of \|Xv\|_2 over different subsets of \mathbb{R}^p. The subset for \sigma_k contained the subset for \sigma_{k+1} and thus \sigma_k \ge \sigma_{k+1}.

We’ll next verify that X = U\Sigma V^T. Since V is orthogonal, the vectors v_1,\ldots,v_p form an orthonormal basis for \mathbb{R}^p. It thus suffices to check that Xv_k = U\Sigma V^Tv_k for k = 1,\ldots,p. Again by the orthogonality of V we have that V^Tv_k = e_k, the k^{th} standard basis vector. Thus,

U\Sigma V^Tv_k = U\Sigma e_k = U\sigma_k e_k = \sigma_k u_k.

Above, we used that \Sigma was a diagonal matrix and that u_k is the k^{th} column of U. If \sigma_k \neq 0, then \sigma_k u_k = Xv_k by definition. If \sigma_k =0, then \|Xv_k\|_2=0 and so Xv_k = 0 = \sigma_ku_k also. Thus, in either case, U\Sigma V^Tv_k = Xv_k and so U\Sigma V^T = X.

The last property we need to verify is that U is orthogonal. Note that this isn’t obvious. At each stage of the process, we made sure that v_{k+1} \in \text{span}\{v_1,\ldots,v_k\}^\perp. However, in the case that \sigma_{k+1} \neq 0, we simply defined u_{k+1} = Xv_{k+1}/\sigma_{k+1}. It is not clear why this would imply that u_{k+1} is orthogonal to u_1,\ldots,u_k.

It turns out that a geometric argument is needed to show this. The idea is that if u_{k+1} was not orthogonal to u_j for some j \le k, then v_j couldn’t have been the value that maximises \|Xv\|_2.

Let u_{k} and u_j be two columns of U with j < k and \sigma_j,\sigma_k > 0. We wish to show that u_j^Tu_k = 0. To show this we will use the fact that v_j and v_k are orthonormal and perform “polar-interpolation“. That is, for \lambda \in [0,1], define

v_\lambda = \sqrt{1-\lambda}\cdot v_{j}-\sqrt{\lambda} \cdot v_k.

Since v_{j} and v_k are orthogonal, we have that

\|v_\lambda\|_2^2 = (1-\lambda)\|v_{j}\|_2^2+\lambda\|v_k\|_2^2 = (1-\lambda)+\lambda = 1.

Furthermore v_\lambda is orthogonal to v_1,\ldots,v_{j-1}. Thus, by definition of v_j,

\|Xv_\lambda\|_2^2 \le \sigma_j^2 = \|Xv_j\|_2^2.

By the linearity of X and the definitions of u_j,u_k,

\|Xv_\lambda\|_2^2 = \|\sqrt{1-\lambda}\cdot Xv_j+\sqrt{\lambda}\cdot Xv_{k+1}\|_2^2 = \|\sigma_j\sqrt{1-\lambda}\cdot u_j+\sigma_{k+1}\sqrt{\lambda}\cdot u_{k+1}\|_2^2.

Since Xv_j = \sigma_ju_j and Xv_{k+1}=\sigma_{k+1}u_{k+1}, we have

(1-\lambda)\sigma_j^2 + 2\sqrt{\lambda(1-\lambda)}\cdot \sigma_1\sigma_2 u_j^Tu_{k}+\lambda\sigma_k^2 = \|Xv_\lambda\|_2^2 \le \sigma_j^2

Rearranging and dividing by \sqrt{\lambda} gives,

2\sqrt{1-\lambda}\cdot \sigma_1\sigma_2 u_j^Tu_k \le \sqrt{\lambda}\cdot(\sigma_j^2-\sigma_k^2). for all \lambda \in (0,1]

Taking \lambda \searrow 0 gives u_j^Tu_k \le 0. Performing the same polar interpolation with v_\lambda' = \sqrt{1-\lambda}v_j - \sqrt{\lambda}v_k shows that -u_j^Tu_k \le 0 and hence u_j^Tu_k = 0.

We have thus proved that U is orthogonal. This proof is pretty “slick” but it isn’t very illuminating. To better demonstrate the concept, I made an interactive Desmos graph that you can access here.

This graph shows example vectors u_j, u_k \in \mathbb{R}^2. The vector u_j is fixed at (1,0) and a quarter circle of radius 1 is drawn. Any vectors u that are outside this circle have \|u\|_2 > 1 = \|u_j\|_2.

The vector u_k can be moved around inside this quarter circle. This can be done either cby licking and dragging on the point or changing that values of a and b on the left. The red curve is the path of

\lambda \mapsto \sqrt{1-\lambda}\cdot u_j+\sqrt{\lambda}\cdot u_k.

As \lambda goes from 0 to 1, the path travels from u_j to u_k.

Note that there is a portion of the red curve near u_j that is outside the black circle. This corresponds to a small value of \lambda > 0 that results in \|X v_\lambda\|_2 > \|Xv_j\|_2 contradicting the definition of v_j. By moving the point u_k around in the plot you can see that this always happens unless u_k lies exactly on the y-axis. That is, unless u_k is orthogonal to u_j.

Maximum likelihood estimation and the method of moments

Maximum likelihood and the method of moments are two ways to estimate parameters from data. In general, the two methods can differ but for one-dimensional exponential families they produce the same estimates.

Suppose that \{P_\theta\}_{\theta \in \Omega} is a one-dimensional exponential family written in canonical form. That is, \Omega \subseteq \mathbb{R} and there exists a reference measure \mu such each distribution P_\theta has a density p_\theta with respect to \mu and,

p_\theta(x) = h(x)\exp\left(\theta T(x)-A(\theta)\right).

The random variable T(X) is a sufficient statistic for the model X \sim P_\theta. The function A(\theta) is the log-partition function for the family \{P_\theta\}_{\theta \in \Omega}. The condition, \int p_\theta(x)\mu(dx)=1 implies that

A(\theta) = \log\left(\int h(x)\exp(\theta T(x))\mu(dx) \right).

It turns out that the function A(\theta) is differentiable and that differentiation and integration are exchangeable. This implies that

A'(\theta) = \frac{\int h(x)\frac{d}{d\theta}\exp(\theta T(x))\mu(dx)}{\int h(x)\exp(\theta T(x))\mu(dx)} = \frac{\int h(x)\frac{d}{d\theta}\exp(\theta T(x))\mu(dx)}{\exp(A(\theta))}

Note that \int h(x)\frac{d}{d\theta}\exp(\theta T(x))\mu(dx) = \int T(x)h(x)\exp(\theta T(x))\mu(dx). Thus,

A'(\theta) = \int T(x)h(x) \exp(\theta T(x)-A(\theta))\mu(dx) = \int T(x)p_\theta(x)\mu(dx).

This means that A'(\theta) = \mathbb{E}_\theta[T(X)], the expectation of T(X) under X \sim P_\theta.

Now suppose that we have an i.i.d. sample X_1,\ldots, X_n \sim P_\theta and we want to use this sample to estimate \theta. One way to estimate \theta is by maximum likelihood. That is, we choose the value of \theta that maximises the likelihood,

L(\theta|X_1,\ldots,X_n) = \prod_{i=1}^n p_\theta(X_i).

When using the maximum likelihood estimator, it is often easier to work with the log-likelihood. The log-likelihood is,

\log L(\theta|X_1,\ldots,X_n) = \sum_{i=1}^n \log\left(p_\theta(X_i)\right) = \sum_{i=1}^n \log(h(X_i))+\theta T(X_i) - A(\theta).

Maximising the likelihood is equivalent to maximising the log-likelihood. For exponential families, the log-likelihood is a concave function of \theta. Thus the maximisers can be found be differentiation and solving the first order equations. Note that,

\frac{d}{d\theta} \log L(\theta|X_1,\ldots,X_n) =\sum_{i=1}^n T(X_i)-A'(\theta) = -nA'(\theta) + \sum_{i=1}^n T(X_i).

Thus the maximum likelihood estimate (MLE) \widehat{\theta} solves the equation,

-nA'(\widehat{\theta}) + \sum_{i=1}^n T(X_i) = 0.

But recall that A'(\widehat{\theta}) = \mathbb{E}_{\widehat{\theta}}[T(X)]. Thus the MLE is the solution to the equation,

\mathbb{E}_{\widehat{\theta}}[T(X)] = \frac{1}{n}\sum_{i=1}^n T(X_i).

Thus the MLE is the value of \theta for which the expectation of T(X) matches the empirical average from our sample. That is, the maximum likelihood estimator for an exponential family is a method of moments estimator. Specifically, the maximum likelihood estimator matches the moments of the sufficient statistic T(X).

A counter example

It is a special property of maximum likelihood estimators that the MLE is a method of moments estimator for the sufficient statistic. When we leave the nice world of exponential families, the estimators may differ.

Suppose that we have data X_1,\ldots,X_n \sim P_\theta where P_\theta is the uniform distribution on [0,\theta]. A minimal sufficient statistic for this model is X_{(n)} – the maximum of X_1,\ldots, X_n. Given what we saw before, we might imague that the MLE for this model would be a method of moments estimator for X_{(n)} but this isn’t the case.

The likelihood for X_1,\ldots,X_n is,

L(\theta|X_1,\ldots,X_n) = \begin{cases} \frac{1}{\theta^n} & \text{if } X_{(n)} \le \theta,\\ 0 & \text{if } X_{(n)} > \theta. \end{cases}

Thus the MLE is \widehat{\theta} = X_{(n)}. However, under P_\theta, \frac{1}{\theta}X_{(n)} has a \text{Beta}(n,1) distribution. Thus, \mathbb{E}_\theta[X_{(n)}]  =   \theta \frac{n}{n+1} so the method of moments estimator would be \widehat{\theta}' = \frac{n+1}{n}X_{(n)} \neq X_{(n)} = \widehat{\theta}.

You could have proved the Neyman-Pearson lemma

The Neyman-Pearson lemma is foundational and important result in the theory of hypothesis testing. When presented in class the proof seemed magical and I had no idea where the ideas came from. I even drew a face like this 😲 next to the usual \square in my book when the proof was finished. Later in class we learnt the method of undetermined multipliers and suddenly I saw where the Neyman-Pearson lemma came from.

In this blog post, I’ll first give some background and set up notation for the Neyman-Pearson lemma. Then I’ll talk about the method of undetermined multipliers and show how it can be used to derive and prove the Neyman-Pearson lemma. Finally, I’ll write about why I still think the Neyman-Pearson lemma is magical despite the demystified proof.

Background

In the set up of the Neyman-Pearson lemma we have data x which is a realisation of some random variable X. We wish to conclude something about the distribution of X from our data x by doing a hypothesis test.

In the Neyman-Pearson lemma we have simple hypotheses. That is our data either comes from the distribution \mathbb{P}_0 or from the distribution \mathbb{P}_1. Thus, our null hypothesis is H_0 : X \sim \mathbb{P}_0 and our alternative hypothesis is H_1 : X \sim \mathbb{P}_1.

A test of H_0 against H_1 is a function \phi that takes in data x and returns a number \phi(x) \in [0,1]. The value \phi(x) is the probability under the test \phi of rejecting H_0 given the observed data x. That is, if \phi(x) = 1, we always reject H_0 and if \phi(x)=0 we never reject H_0. For in-between values \phi(x) = \gamma \in (0,1), we reject H_0 with probability \gamma.

An ideal test would have two desirable properties. We would like a test that rejects H_0 with a low probability when H_0 is true but we would also like the test to reject H_0 with a high probability when H_1 is true. To state this more formally, let \mathbb{E}_0[\phi(X)] and \mathbb{E}_1[\phi(X)] be the expectation of \phi(X) under H_0 and H_1 respectively. The quantity \mathbb{E}_0[\phi(X)] is the probability we reject H_0 when H_0 is true. Likewise, the quantity \mathbb{E}_1[\phi(X)] is the probability we reject H_0 when H_1 is true. An optimal test would be one that minimises \mathbb{E}_0[\phi(X)] and maximises \mathbb{E}_1[\phi(X)].

Unfortunately the goals of minimising \mathbb{E}_0[\phi(X)] and maximising \mathbb{E}_1[\phi(X)] are at odds with one another. This is because we want \phi to be small in order to minimise \mathbb{E}_0[\phi(X)] and we want \phi to be large to maximise \mathbb{E}_1[\phi(X)]. In nearly all cases we have to trade off between these two goals and there is no single test that simultaneously achieves both.

To work around this, a standard approach is to focus on maximising \mathbb{E}_1[\phi(X)] while requiring that \mathbb{E}_0[\phi(X)] remains below some threshold. The quantity \mathbb{E}_1[\phi(X)] is called the power of the test \phi. If \alpha is a number between 0 and 1, we will say that \phi has level\alpha if \mathbb{E}_1[\phi(X)] \le \alpha. A test \phi is said to be most powerful at level-\alpha, if

  • The test \phi is level-\alpha.
  • For all level-\alpha tests \phi', the test \phi is more powerful than \phi'. That is,

\mathbb{E}_1[\phi'(X)] \le \mathbb{E}_1[\phi(X)].

Thus we can see that finding a most powerful level-\alpha test is a constrained optimisation problem. We wish to maximise the quantity

\mathbb{E}_1[\phi(X)]

subject to the constraint

\mathbb{E}_0[\phi(X)] \le \alpha

With this in mind, we turn to the method of undetermined multipliers.

The method of undetermined multipliers

The method of undetermined multipliers (also called the method of Lagrange multipliers) is a very general optimisation tool. Suppose that we have a set U and two function f,g : U \to \mathbb{R} and we wish to maximise f(u) subject to the constraint g(u) \le 0.

In the context of hypothesis testing, the set U is the set of all tests \phi. The objective function f is given by f(\phi) = \mathbb{E}_1[\phi(X)]. That is, f(\phi) is the power of the test \phi. The constraint function g is given by g(\phi)=\mathbb{E}_1[\phi(X)]-\alpha so that g(\phi) \le 0 if and only if \phi has level-\alpha.

The method of undetermined multipliers allows us to reduce this constrained optimisation problem to a (hopefully easier) unconstrained optimisation problem. More specifically we have the following result:

Proposition: Suppose that u^* \in U is such that:

  • g(u^*) = 0,
  • There exists k \ge 0, such that u^* maximises f(u)-kg(u) over all u \in U.

Then u maximises f(u) under the constraint g(u) \le 0.

Proof: Suppose that u^* satisfies the above two dot points. We need to show that for all u \in U, if g(u) \le 0, then f(u) \le f(u^*). By assumption we know that f(u^*)=0 and u^* maximises f(u)-kg(u). Thus, for all u \in U,

f(u^*)=f(u^*)-kg(u^*) \ge f(u)-kg(u).

Now suppose g(u) \le 0. Then, kg(u) \le 0 and so f(u)-kg(u)\ge f(u) and so f(u^*) \ge f(u) as needed. \square

The constant k is the undetermined multiplier. The way to use the method of undetermined is to find values u_k^* that maximise h_k(u) = f(u)-kg(u) for each k \ge 0. The multiplier k is then varied so that the constraint g(u^*_k) = 0 is satisfied.

Proving the Neyman-Pearson lemma

Now let’s use the method of undetermined multipliers to find most powerful tests. Recall the set U which we are optimising over is the set of all tests \phi. Recall also that we wish to optimise f(\phi) = \mathbb{E}_1[\phi(X)] subject to the constraint g(\phi) = \mathbb{E}_0[\phi(X)] - \alpha \le 0. The method of undetermined multipliers says that we should consider maximising the function

h_k(\phi) = \mathbb{E}_1[\phi(X)] - k\mathbb{E}_0[\phi(X)],

where k \ge 0. Suppose that both \mathbb{P}_0 and \mathbb{P}_1 have densities1 p_0 and p_1 with respect to some measure \mu. We can we can write the above expectations as integrals. That is,

\mathbb{E}_0[\phi(X)] = \int \phi(x)p_0(x)\mu(dx) and \mathbb{E}_1[\phi(X)] = \int \phi(x)p_1(x)\mu(dx).

Thus the function h_k is equal to

h_k(\phi) =  \int \phi(x)p_1(x)\mu(dx)- k\int \phi(x)p_0(x)\mu(dx)=\int \phi(x)(p_1(x)-kp_0(x))\mu(dx).

We now wish to maximise the function h_k(\phi). Recall that \phi is a function that take values in [0,1]. Thus, the integral

\int \phi(x)(p_1(x)-kp_0(x))\mu(dx),

is maximised if and only if \phi(x)=1 when p_1(x)-kp_0(x) > 0 and \phi(x)=0 when p_1(x)-kp_0(x) < 0. Note that p_1(x)-kp_0(x) > 0 if and only if \frac{p_1(x)}{p_0(x)} > k. Thus for each k \ge 0, a test \phi_k maximises h_k(\phi) if and only if

\phi_k(x) = \begin{cases} 1 & \text{if } \frac{p_1(x)}{p_0(x)} > k,\\ 0 &\text{if } \frac{p_1(x)}{p_0(x)} < k. \end{cases}

The method of undetermined multipliers says that if we can find k so that the above is satisfied and g(\phi_k) = 0, then \phi_k is a most powerful test. Recall that g(\phi_k)=0 is equivalent to \mathbb{E}_1[\phi(X)]=\alpha. By summarising the above argument, we arrive at the Neyman-Pearson lemma,

Neyman-Pearson Lemma2: Suppose that \phi is a test such that

  • \mathbb{E}_0[\phi(X)] = \alpha, and
  • For some k \ge 0, \phi(x) = \begin{cases} 1 & \text{if } \frac{p_1(x)}{p_0(x)} > k,\\ 0 & \text{if } \frac{p_1(x)}{p_0(x)}< k.\end{cases}

then \phi is most powerful at level-\alpha for testing H_0 : X \sim \mathbb{P}_0 against H_1 : X \sim \mathbb{P}_1.

The magic of Neyman-Pearson

By learning about undetermined multipliers I’ve been able to better understand the proof of the Neyman-Pearson lemma. I now view it is as clever solution to a constrained optimisation problem rather than something that comes out of nowhere.

There is, however, a different aspect of Neyman-Pearson that continues to surprise me. This aspect is the usefulness of the lemma. At first glance the Neyman-Pearson lemma seems to be a very specialised result because it is about simple hypothesis testing. In reality most interesting hypothesis tests have composite nulls or composite alternatives or both. It turns out that Neyman-Pearson is still incredibly useful for composite testing. Through ideas like monotone likelihood ratios, least favourable distributions and unbiasedness, the Neyman-Pearson lemma or similar ideas can be used to find optimal tests in a variety of settings.

Thus I must admit that the title of this blog post is a little inaccurate and deceptive. I do believe that, given the tools of undetermined multipliers and the set up of simple hypothesis testing, one is naturally led to the Neyman-Pearson lemma. However, I don’t believe that many could have realised how useful and interesting simple hypothesis testing would be.

Footnotes

  1. The assumption that \mathbb{P}_0 and \mathbb{P}_1 have densities with respect to a common measure \mu is not a restrictive assumption since one can always take \mu = \mathbb{P}_0+\mathbb{P}_1 and the apply Radon-Nikodym. However there is often a more natural choice of \mu such as Lebesgue measure on \mathbb{R}^d or the counting measure on \mathbb{N}.
  2. What I call the Neyman-Pearson lemma is really only a third of the Neyman-Pearson lemma. There are two other parts. One that guarantees the existence of a most powerful test and one that is a partial converse to the statement in this post.

Leverages Scores

I am very excited to be writing a blog post again – it has been nearly a year! This post marks a new era for the blog. In September I started a statistics PhD at Stanford University. I am really enjoying my classes and I am learning a lot. I might have to change the name of the blog soon but for now let’s stick with “Maths to Share” although you will undoubtedly see more and more statistics here.

Today I would like to talk about leverages scores. Leverages scores are a way to quantify how sensitive a model is and they can be used to explain the different behaviour in these two animations

Linear Models

I recently learnt about leverage scores in the applied statistics course STATS 305A. This course is all about the linear model. In the linear model we assume with have n data points (x_i,y_i) where x_i is a vector in \mathbb{R}^d and y_i is a number in \mathbb{R}. We model y_i as a linear function of x_i plus noise. That is we assume y_i = x_i^T\beta + \varepsilon_i, where \beta \in \mathbb{R}^d is a unknown vector of coefficients and \varepsilon_i is a random variable with mean 0 and variance \sigma^2. We also require that for i \neq j, the random variable \varepsilon_i is uncorrelated with \varepsilon_j.

We can also write this as a matrix equation. Define y to be the vector with entries y_i and define X to be the matrix with rows x_i^T, that is

y = \begin{bmatrix} y_1\\ y_2 \\ \vdots \\ y_n \end{bmatrix} \in \mathbb{R}^n and X = \begin{bmatrix} -x_1^T-\\-x_2^T-\\ \vdots \\ -x_n^T-\end{bmatrix} \in \mathbb{R}^{n \times d}.

Then our model can be rewritten as

y = X\beta + \varepsilon,

where \varepsilon \in \mathbb{R}^n is a random vector with mean 0 and covariance matrix \sigma^2 I_n. To simplify calculations we will assume that X contains an intercept term. This means that the first column of X consists of all 1’s.

In the two animations at the start of this post we have two nearly identical data sets. The data sets are an example of simple regression when each vector x_i is of the form (1,z_i) where z_i is a number. The values z_i are on the horizontal axis and the values y_i are on the vertical axis.

Estimating the coefficients

In the linear model we wish to estimate the parameter \beta which contains the coefficients of our model. That is, given a sample (y_i,x_i)_{i=1}^n, we wish to construct a vector \widehat{\beta} which approximates the true parameter \beta. In ordinary least square regression we choose \widehat{\beta} to be the vector b \in \mathbb{R}^d that minimizes the quantity

\sum_{i=1}^n (x_i^T b - y_i)^2=\left \Vert Xb - y \right \Vert_2^2.

Differentiating with respect to b and setting the derivative equal to 0 shows that \widehat{\beta} is a solution to the normal equations:

X^TXb = X^T y.

We will assume that the matrix X^TX is invertible. In this case then the normal equations have a unique solution \widehat{\beta} = (X^TX)^{-1}X^T y.

Now that we have our estimate \widehat{\beta}, we can do prediction. If we are given a new value x' \in \mathbb{R}^d we would use x'^T\widehat{\beta} to predict the corresponding value of y'. This was how the straight lines in the two animations were calculated.

We can also calculate the model’s predicted values for the data x_i that we used to fit the model. These are denoted by \widehat{y}. Note that

\widehat{y} = X\widehat{\beta} = X(X^TX)^{-1}X^Ty = Hy,

where H = X(X^TX)^{-1}X^T is called the hat matrix for the model (since it puts the hat \widehat{ } on y.

Leverage scores

We are now ready to talk about leverage scores and the two animations. For reference, here they are again:

In both animations the stationary line corresponds to an estimator \widehat{\beta} that was calculated using only the black data points. The red points are new data points with different x values and varying y values. The moving line corresponds to an estimator \widehat{\beta} calculated using the red data point as well as all the black points. We can see immediately that if the red point is far away from the “bulk” of the other x points, then the moving line is a lot more sensitive to the y value of the red point.

The leverage score of a data point (x_i,y_i) is defined to be \frac{\partial \widehat{y}_i}{\partial y_i}. That is, the leverage score tells us how much does the prediction \widehat{y}_i change if we change y_i.

Since \widehat{y} = Hy, the leverage score of (x_i,y_i) is H_{ii}, the i^{th} diagonal element of the hat matrix H. The idea is that if a data point (x_i,y_i) has a large leverage score, then the model is more sensitive to changes in that value of y_i.

This can be seen in a leave one out calculation. This calculation tells us what we should expect if we make a leave-one-out model – a model that uses all the data points apart from one. In our animations, this corresponds to the stationary line.

The leave one out calculation says that the predicted value using all the data is always between the true value and the predicted value from the leave-one-out model. In our animations this can be seen by noting that the moving line (the full model) is always between the red point (the true value) and the stationary line (the leave-one-out model).

Furthermore the leverage score tells us exactly how close the predicted value is to the true value. We can see that the moving line is much closer to the red dot in the high leverage example on the right than the low leverage example on the left.

Mahalanobis distance

We now know that the two animations are showing the sensitivity of a model to two different data points. We know that a model is more sensitive to point with high leverage than to points with low leverage. We still haven’t spoken about why some point have higher leverage and why the point on the right has higher leverage.

It turns out that leverage score are measuring how far away a data point is from the “bulk” of the other x_i‘s. More specifically in a one dimensional example like what we have in the animations

H_{ii} = \frac{1}{n}\left(\frac{1}{S^2}(x_i-\bar{x})^2+1\right),

where n is the number of data points, \bar{x} = \frac{1}{n}\sum_{j=1}^n x_j is the sample mean and S^2 = \frac{1}{n}\sum_{j=1}^n (x_j-\bar{x})^2 is the sample variance. Thus high leverage scores correspond to points that are far away from the centre of our data x_i. In higher dimensions a similar result holds if we measure distance using Mahalanobis distance.

The mean of the black data points is approximately 2 and so we can now see why the second point has higher leverage. The two animations were made in Geogebra. You can play around with them here and here.