Two sample tests as correlation tests

Suppose we have two samples Y_1^{(0)}, Y_2^{(0)},\ldots, Y_{n_0}^{(0)} and Y_1^{(1)},Y_2^{(1)},\ldots, Y_{n_1}^{(1)} and we want to test if they are from the same distribution. Many popular tests can be reinterpreted as correlation tests by pooling the two samples and introducing a dummy variable that encodes which sample each data point comes from. In this post we will see how this plays out in a simple t-test.

The equal variance t-test

In the equal variance t-test, we assume that Y_i^{(0)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_0,\sigma^2) and Y_i^{(1)} \stackrel{\text{iid}}{\sim}  \mathcal{N}(\mu_1,\sigma^2), where \sigma^2 is unknown. Our hypothesis that Y_1^{(0)}, Y_2^{(0)},\ldots, Y_{n_0}^{(0)} and Y_1^{(1)},Y_2^{(1)},\ldots, Y_{n_1}^{(1)} are from the same distribution becomes the hypothesis \mu_0 = \mu_1. The test statistic is

t = \frac{\displaystyle \overline{Y}^{(1)} - \overline{Y}^{(0)}}{\displaystyle \hat{\sigma}\sqrt{\frac{1}{n_0}+\frac{1}{n_1}}},

where \overline{Y}^{(0)} and \overline{Y}^{(1)} are the two sample means. The variable \hat{\sigma} is the pooled estimate of the standard deviation and is given by

\hat{\sigma}^2 = \displaystyle\frac{1}{n_0+n_1-2}\left(\sum_{i=1}^{n_0}\left(Y_i^{(0)}-\overline{Y}^{(0)}\right)^2 + \sum_{i=1}^{n_1}\left(Y_i^{(1)}-\overline{Y}^{(1)}\right)^2\right).

Under the null hypothesis, t follows the T-distribution with n_0+n_1-2 degrees of freedom. We thus reject the null \mu_0=\mu_1 when |t| exceeds the 1-\alpha/2 quantile of the T-distribution.

Pooling the data

We can turn this two sample test into a correlation test by pooling the data and using a linear model. Let Y_1,\ldots,Y_{n_0}, Y_{n_0+1},\ldots,Y_{n_0+n_1} be the pooled data and for i = 1,\ldots, n_0+n_1, define x_i \in \{0,1\} by

x_i = \begin{cases} 0 & \text{if } 1 \le i \le n_0,\\ 1 & \text{if } n_0+1 \le i \le n_0+n_1.\end{cases}

The assumptions that Y_i^{(0)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_0,\sigma^2) and Y_i^{(1)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_1,\sigma^2) can be rewritten as

Y_i = \beta_0+\beta_1x_i + \varepsilon_i,

where \varepsilon_i \stackrel{\text{iid}}{\sim} \mathcal{N}(0,\sigma^2). That is, we have expressed our modelling assumptions as a linear model. When working with this linear model, the hypothesis \mu_0 = \mu_1 is equivalent to \beta_1 = 0. To test \beta_1 = 0 we can use the standard t-test for a coefficient in linear model. The test statistic in this case is

t' = \displaystyle\frac{\hat{\beta}_1}{\hat{\sigma}_{OLS}\sqrt{(X^TX)^{-1}_{11}}},

where \hat{\beta}_1 is the ordinary least squares estimate of \beta_1, X \in \mathbb{R}^{(n_0+n_1)\times 2} is the design matrix and \hat{\sigma}_{OLS} is an estimate of \sigma given by

\hat{\sigma}_{OLS}^2 = \displaystyle\frac{1}{n_0+n_1-2}\sum_{i=1}^{n_0+n_1} (Y_i-\hat{Y}_i)^2,

where \hat{Y} = \hat{\beta}_0+\hat{\beta}_1x_i is the fitted value of Y_i.

It turns out that t' is exactly equal to t. We can see this by writing out the design matrix and calculating everything above. The design matrix has rows [1,x_i] and is thus equal to

X = \begin{bmatrix} 1&x_1\\ 1&x_2\\  \vdots&\vdots\\ 1&x_{n_0}\\  1&x_{n_0+1}\\ \vdots&\vdots\\ 1&x_{n_0+n_1}\end{bmatrix} = \begin{bmatrix} 1&0\\ 1&0\\  \vdots&\vdots\\ 1&0\\  1&1\\ \vdots&\vdots\\ 1&1\end{bmatrix}.

This implies that

X^TX = \begin{bmatrix} n_0+n_1 &n_1\\n_1&n_1 \end{bmatrix},

And therefore,

(X^TX)^{-1} = \frac{1}{(n_0+n_1)n_1-n_1^2}\begin{bmatrix} n_1 &-n_1\\-n_1&n_0+n_1 \end{bmatrix} = \frac{1}{n_0n_1}\begin{bmatrix} n_1&-n_1\\-n_1&n_0+n_1\end{bmatrix} =\begin{bmatrix} \frac{1}{n_0}&-\frac{1}{n_0}\\-\frac{1}{n_0}&\frac{1}{n_0}+\frac{1}{n_1}\end{bmatrix} .

Thus, (X^TX)^{-1}_{11} = \frac{1}{n_0}+\frac{1}{n_1}. So,

t' = \displaystyle\frac{\hat{\beta}_1}{\hat{\sigma}_{OLS}\sqrt{\frac{1}{n_0}+\frac{1}{n_1}}},

which is starting to like t from the two-sample test. Now

X^TY = \begin{bmatrix} \displaystyle\sum_{i=1}^{n_0+n_1} Y_i\\ \displaystyle \sum_{i=n_0+1}^{n_0+n_1} Y_i \end{bmatrix} = \begin{bmatrix} n_0\overline{Y}^{(0)} + n_1\overline{Y}^{(1)}\\  n_1\overline{Y}^{(1)} \end{bmatrix}.

And so

\hat{\beta} = (X^TX)^{-1}X^TY = \begin{bmatrix} \frac{1}{n_0}&-\frac{1}{n_0}\\-\frac{1}{n_0}&\frac{1}{n_0}+\frac{1}{n_1}\end{bmatrix}\begin{bmatrix} n_0\overline{Y}^{(0)} + n_1\overline{Y}^{(1)}\\  n_1\overline{Y}^{(1)} \end{bmatrix}=\begin{bmatrix} \overline{Y}^{(0)}\\  \overline{Y}^{(1)} -\overline{Y}^{(0)}\end{bmatrix}.

Thus, \hat{\beta}_1 = \overline{Y}^{(1)} -\overline{Y}^{(0)} and

t' = \displaystyle\frac{\overline{Y}^{(1)}-\overline{Y}^{(0)}}{\hat{\sigma}_{OLS}\sqrt{\frac{1}{n_0}+\frac{1}{n_1}}}.

This means to show that t' = t, we only need to show that \hat{\sigma}_{OLS}^2=\hat{\sigma}^2. To do this, note that the fitted values \hat{Y} are equal to

\displaystyle\hat{Y}_i=\hat{\beta}_0+x_i\hat{\beta}_1 = \begin{cases} \overline{Y}^{(0)} & \text{if } 1 \le i \le n_0,\\ \overline{Y}^{(1)} & \text{if } n_0+1\le i \le n_0+n_1\end{cases}.

Thus,

\hat{\sigma}^2_{OLS} = \displaystyle\frac{1}{n_0+n_1-2}\sum_{i=1}^{n_0+n_1}\left(Y_i-\hat{Y}_i\right)^2=\displaystyle\frac{1}{n_0+n_1-2}\left(\sum_{i=1}^{n_0}\left(Y_i^{(0)}-\overline{Y}^{(0)}\right)^2 + \sum_{i=1}^{n_1}\left(Y_i^{(1)}-\overline{Y}^{(1)}\right)^2\right),

Which is exactly \hat{\sigma}^2. Therefore, t'=t and the two sample t-test is equivalent to a correlation test.

The Friedman-Rafsky test

In the above example, we saw that the two sample t-test was a special case of the t-test for regressions. This is neat but both tests make very strong assumptions about the data. However, the same thing happens in a more interesting non-parametric setting.

In their 1979 paper, Jerome Friedman and Lawrence Rafsky introduced a two sample tests that makes no assumptions about the distribution of the data. The two samples do not even have to real-valued and can instead be from any metric space. It turns out that their test is a special case of another procedure they devised for testing for association (Friedman & Rafsky, 1983). As with the t-tests above, this connection comes from pooling the two samples and introducing a dummy variable.

I plan to write a follow up post explaining these procedures but you can also read about it in Chapter 6 of Group Representations in Probability and Statistics by Persi Diaconis.

References

Persi Diaconis “Group representations in probability and statistics,” pp 104-106, Hayward, CA: Institute of Mathematical Statistics, (1988)

Jerome H. Friedman, Lawrence C. Rafsky “Multivariate Generalizations of the Wald-Wolfowitz and Smirnov Two-Sample Tests,” The Annals of Statistics, Ann. Statist. 7(4), 697-717, (July, 1979)

Jerome H. Friedman, Lawrence C. Rafsky “Graph-Theoretic Measures of Multivariate Association and Prediction,” The Annals of Statistics, Ann. Statist. 11(2), 377-391, (June, 1983).

MCMC for hypothesis testing

A Monte Carlo significance test of the null hypothesis X_0 \sim H requires creating independent samples X_1,\ldots,X_B \sim H. The idea is if X_0 \sim H and independently X_1,\ldots,X_B are i.i.d. from H, then for any test statistic T, the rank of T(X_0) among T(X_0), T(X_1),\ldots, T(X_B) is uniformly distributed on \{1,\ldots, B+1\}. This means that if T(X_0) is one of the k largest values of T(X_b), then we can reject the hypothesis X \sim H at the significance level k/(B+1).

The advantage of Monte Carlo significance tests is that we do not need an analytic expression for the distribution of T(X_0) under X_0 \sim H. By generating the i.i.d. samples X_1,\ldots,X_B, we are making an empirical distribution that approximates the theoretical distribution. However, sometimes sampling X_1,\ldots, X_B is just as intractable as theoretically studying the distribution of T(X_0) . Often approximate samples based on Markov chain Monte Carlo (MCMC) are used instead. However, these samples are not independent and may not be sampling from the true distribution. This means that a test using MCMC may not be statistically valid

In the 1989 paper Generalized Monte Carlo significance tests, Besag and Clifford propose two methods that solve this exact problem. Their two methods can be used in the same settings where MCMC is used but they are statistically valid and correctly control the Type 1 error. In this post, I will describe just one of the methods – the serial test.

Background on Markov chains

To describe the serial test we will need to introduce some notation. Let P denote a transition matrix for a Markov chain on a discrete state space \mathcal{X}. A Markov chain with transition matrix P thus satisfies,

\mathbb{P}(X_n = x_n \mid X_0 = x_0,\ldots, X_{n-1} = x_{n-1}) = P(x_{n-1}, x_n)

Suppose that the transition matrix P has a stationary distribution \pi. This means that if (X_0, X_1,\ldots) is a Markov chain with transition matrix P and X_0 is distributed according to \pi, then X_1 is also distributed according to \pi. This implies that all X_i are distributed according to \pi.

We can construct a new transition matrix Q from P and \pi by Q(y,x) = P(x,y) \frac{\pi(x)}{\pi(y)}. The transition matrix Q is called the reversal of P. This is because for all x and y in \mathcal{X}, \pi(x)P(x,y) = \pi(y)Q(x,y). That is the chance of drawing x from \pi and then transitioning to y according to P is equal to the chance of drawing y from \pi and then transitioning to x according to Q

The new transition matrix Q also allows us to reverse longer runs of the Markov chain. Fix n \ge 0 and let (X_0,X_1,\ldots,X_n) be a Markov chain with transition matrix P and initial distribution \pi. Also let (Y_0,Y_1,\ldots,Y_n) be a Markov chain with transition matrix Q and initial distribution \pi, then

(X_0,X_1,\ldots,X_{n-1}, X_n) \stackrel{dist}{=} (Y_n,Y_{n-1},\ldots, Y_1,Y_0) ,

where \stackrel{dist}{=} means equal in distribution.

The serial test

Suppose we want to test the hypothesis x \sim \pi where x \in \mathcal{X} is our observed data and \pi is some distribution on \mathcal{X}. To conduct the serial test, we need to construct a Markov chain P for which \pi is a stationary distribution. We then also need to construct the reversal Q described above. There are many possible ways to construct P such as the Metropolis-Hastings algorithm.

We also need a test statistic T. This is a function T:\mathcal{X} \to \mathbb{R} which we will use to detect outliers. This function is the same function we would use in a regular Monte Carlo test. Namely, we want to reject the null hypothesis when T(x) is much larger than we would expect under x \sim \pi.

The serial test then proceeds as follows. First we pick \xi uniformly in \{0,1,\ldots,B\} and set X_\xi = x. We then generate (X_\xi, X_{\xi+1},\ldots,X_B) as a Markov chain with transition matrix P that starts at X_\xi = x. Likewise we generate (X_\xi, X_{\xi-1},\ldots, X_0) as a Markov chain that starts from Q.

We then apply T to each of (X_0,X_1,\ldots,X_\xi,\ldots,X_B) and count the number of b \in \{0,1,\ldots,B\} such that T(X_b) \ge T(X_\xi) = T(x). If there are k such b, then the reported p-value of our test is k/(B+1).

We will now show that this test produces a valid p-value. That is, when x \sim \pi, the probability that k/(B+1) is less than \alpha is at most \alpha. In symbols,

\mathbb{P}\left( \frac{\#\{b: T(X_b) \ge T(X_\xi)\}}{B+1} \le \alpha \right) \le \alpha

Under the null hypothesis x \sim \pi, (X_\xi, X_{\xi-1},\ldots, X_0) is equal in distribution to generating X_0 from \pi and using the transition matrix P to go from X_i to X_{i+1}.

Thus, under the null hypothesis, the distribution of (X_0,X_1,\ldots,X_B) does not depend on \xi. The entire procedure is equivalent to generating a Markov chain \mathbf{X} = (X_0,X_1,\ldots,X_B) with initial distribution \pi and transition matrix P, and then choosing \xi \in \{0,1,\ldots,B\} independently of \mathbf{X}. This is enough to show that the serial method produces valid p-values. The idea is that since the distribution of \mathbf{X} does not depend on \xi and \xi is uniformly distributed on \{0,1,\ldots,B\}, the probability that T(X_\xi) is in the top \alpha proportion of T(X_b) should be at most \alpha. This is proved more formally below.

For each i \in \{0,1,\ldots,B\}, let A_i be the event that T(X_i) is in the top \alpha proportion of T(X_0),\ldots,T(X_B). That is,

A_i = \left\{ \frac{\# \{b : T(X_b) \ge T(X_i)\} }{B+1} \le \alpha \right\}.

Let I_{A_i} be the indicator function for the event A_i. Since at must \alpha(B+1) values of i can be in the top \alpha fraction of T(X_0),\ldots,T(X_B), we have that

\sum_{i=0}^B I_{A_i} \le \alpha(B+1),

Therefor, by linearity of expectations,

\sum_{i=0}^B \mathbb{P}(A_i) \le \alpha(B+1)

By the law of total probability we have,

\mathbb{P}\left(\frac{\#\{b: T(X_b) \ge T(X_\xi)\}}{B+1} \le \alpha \right) = \sum_{i=0}^B \mathbb{P}\left(\frac{\#\{b: T(X_b) \ge T(X_\xi)\}}{B+1} \le \alpha | \xi = i\right)\mathbb{P}(\xi = i),

Since \xi is uniform on \{0,\ldots,B\}, \mathbb{P}(\xi = i) is \frac{1}{B+1} for all i. Furthermore, by independence of \xi and \mathbf{X}, we have

\mathbb{P}\left(\frac{\#\{b: T(X_b) \ge T(X_\xi)\}}{B+1} \le \alpha | \xi = i\right) = \mathbb{P}\left(\frac{\#\{b: T(X_b) \ge T(X_i)\}}{B+1} \le \alpha | \xi = i\right) = \mathbb{P}(A_i).

Thus, by our previous bound,

\mathbb{P}\left(\frac{\#\{b: T(X_b) \ge T(X_\xi)\}}{B+1} \le \alpha \right) = \frac{1}{B+1}\sum_{i=0}^B \mathbb{P}(A_i) \le \alpha.

Applications

In the original paper by Besag and Clifford, the authors discuss how this procedure can be used to perform goodness-of-fit tests. They construct Markov chains that can test the Rasch model or the Ising model. More broadly the method can be used to tests goodness-of-fit tests for any exponential family by using the Markov chains developed by Diaconis and Sturmfels.

A similar method has also been applied more recently to detect Gerrymandering. In this setting, the null hypothesis is the uniform distribution on all valid redistrictings of a state and the test statistics T measure the political advantage of a given districting.

References

  1. Besag, Julian, and Peter Clifford. “Generalized Monte Carlo Significance Tests.” Biometrika 76, no. 4 (1989)
  2. Persi Diaconis, Bernd Sturmfels “Algebraic algorithms for sampling from conditional distributions,” The Annals of Statistics, Ann. Statist. 26(1), 363-397, (1998)
  3. Chikina, Maria et al. “Assessing significance in a Markov chain without mixing.” Proceedings of the National Academy of Sciences of the United States of America vol. 114,11 (2017)

Sampling from the non-central chi-squared distribution

The non-central chi-squared distribution is a generalisation of the regular chi-squared distribution. The chi-squared distribution turns up in many statistical tests as the (approximate) distribution of a test statistic under the null hypothesis. Under alternative hypotheses, those same statistics often have approximate non-central chi-squared distributions.

This means that the non-central chi-squared distribution is often used to study the power of said statistical tests. In this post I give the definition of the non-central chi-squared distribution, discuss an important invariance property and show how to efficiently sample from this distribution.

Definition

Let Z be a normally distributed random vector with mean 0 and covariance I_n. Given a vector \mu \in \mathbb{R}^n, the non-central chi-squared distribution with n degrees of freedom and non-centrality parameter \Vert \mu\Vert_2^2 is the distribution of the quantity

\Vert Z+\mu \Vert_2^2 = \sum\limits_{i=1}^n (Z_i+\mu_i)^2.

This distribution is denoted by \chi^2_n(\Vert \mu \Vert_2^2). As this notation suggests, the distribution of \Vert Z+\mu \Vert_2^2 depends only on \Vert \mu \Vert_2^2, the norm of \mu. The first few times I heard this fact, I had no idea why it would be true (and even found it a little spooky). But, as we will see below, the result is actually a simply consequence of the fact that standard normal vectors are invariant under rotations.

Rotational invariance

Suppose that we have two vectors \mu, \nu \in \mathbb{R}^n such that \Vert \mu\Vert_2^2 = \Vert \nu \Vert_2^2. We wish to show that if Z \sim \mathcal{N}(0,I_n), then

\Vert Z+\mu \Vert_2^2 has the same distribution as \Vert Z + \nu \Vert_2^2.

Since \mu and \nu have the same norm there exists an orthogonal matrix U \in \mathbb{R}^{n \times n} such that U\mu = \nu. Since U is orthogonal and Z \sim \mathcal{N}(0,I_n), we have Z'=UZ \sim \mathcal{N}(U0,UU^T) = \mathcal{N}(0,I_n). Furthermore, since U is orthogonal, U preserves the norm \Vert \cdot \Vert_2^2. This is because, for all x \in \mathbb{R}^n,

\Vert Ux\Vert_2^2 = (Ux)^TUx = x^TU^TUx=x^Tx=\Vert x\Vert_2^2.

Putting all these pieces together we have

\Vert Z+\mu \Vert_2^2 = \Vert U(Z+\mu)\Vert_2^2 = \Vert UZ + U\mu \Vert_2^2 = \Vert Z'+\nu \Vert_2^2.

Since Z and Z' have the same distribution, we can conclude that \Vert Z'+\nu \Vert_2^2 has the same distribution as \Vert Z + \nu \Vert. Since \Vert Z + \mu \Vert_2^2 = \Vert Z'+\nu \Vert_2^2, we are done.

Sampling

Above we showed that the distribution of the non-central chi-squared distribution, \chi^2_n(\Vert \mu\Vert_2^2) depends only on the norm of the vector \mu. We will now use this to provide an algorithm that can efficiently generate samples from \chi^2_n(\Vert \mu \Vert_2^2).

A naive way to sample from \chi^2_n(\Vert \mu \Vert_2^2) would be to sample n independent standard normal random variables Z_i and then return \sum_{i=1}^n (Z_i+\mu_i)^2. But for large values of n this would be very slow as we have to simulate n auxiliary random variables Z_i for each sample from \chi^2_n(\Vert \mu \Vert_2^2). This approach would not scale well if we needed many samples.

An alternative approach uses the rotation invariance described above. The distribution \chi^2_n(\Vert \mu \Vert_2^2) depends only on \Vert \mu \Vert_2^2 and not directly on \mu. Thus, given \mu, we could instead work with \nu = \Vert \mu \Vert_2 e_1 where e_1 is the vector with a 1 in the first coordinate and 0s in all other coordinates. If we use \nu instead of \mu, we have

\sum\limits_{i=1}^n (Z_i+\nu_i)^2 = (Z_1+\Vert \mu \Vert_2)^2 + \sum\limits_{i=2}^{n}Z_i^2.

The sum \sum_{i=2}^n Z_i^2 follows the regular chi-squared distribution with n-1 degrees of freedom and is independent of Z_1. The regular chi-squared distribution is a special case of the gamma distribution and can be effectively sampled with rejection sampling for large shape parameter (see here).

The shape parameter for \sum_{i=2}^n Z_i^2 is \frac{n-1}{2}, so for large values of n we can efficiently sample a value Y that follows that same distribution as \sum_{i=2}^n Z_i^2 \sim \chi^2_{n-1}. Finally to get a sample from \chi^2_n(\Vert \mu \Vert_2^2) we independently sample Z_1, and then return the sum (Z_1+\Vert \mu\Vert_2)^2 +Y.

Conclusion

In this post, we saw that the rotational invariance of the standard normal distribution gives a similar invariance for the non-central chi-squared distribution.

This invariance allowed us to efficiently sample from the non-central chi-squared distribution. The sampling procedure worked by reducing the problem to sampling from the regular chi-squared distribution.

The same invariance property is also used to calculate the cumulative distribution function and density of the non-central chi-squared distribution. Although the resulting formulas are not for the faint of heart.

You could have proved the Neyman-Pearson lemma

The Neyman-Pearson lemma is foundational and important result in the theory of hypothesis testing. When presented in class the proof seemed magical and I had no idea where the ideas came from. I even drew a face like this 😲 next to the usual \square in my book when the proof was finished. Later in class we learnt the method of undetermined multipliers and suddenly I saw where the Neyman-Pearson lemma came from.

In this blog post, I’ll first give some background and set up notation for the Neyman-Pearson lemma. Then I’ll talk about the method of undetermined multipliers and show how it can be used to derive and prove the Neyman-Pearson lemma. Finally, I’ll write about why I still think the Neyman-Pearson lemma is magical despite the demystified proof.

Background

In the set up of the Neyman-Pearson lemma we have data x which is a realisation of some random variable X. We wish to conclude something about the distribution of X from our data x by doing a hypothesis test.

In the Neyman-Pearson lemma we have simple hypotheses. That is our data either comes from the distribution \mathbb{P}_0 or from the distribution \mathbb{P}_1. Thus, our null hypothesis is H_0 : X \sim \mathbb{P}_0 and our alternative hypothesis is H_1 : X \sim \mathbb{P}_1.

A test of H_0 against H_1 is a function \phi that takes in data x and returns a number \phi(x) \in [0,1]. The value \phi(x) is the probability under the test \phi of rejecting H_0 given the observed data x. That is, if \phi(x) = 1, we always reject H_0 and if \phi(x)=0 we never reject H_0. For in-between values \phi(x) = \gamma \in (0,1), we reject H_0 with probability \gamma.

An ideal test would have two desirable properties. We would like a test that rejects H_0 with a low probability when H_0 is true but we would also like the test to reject H_0 with a high probability when H_1 is true. To state this more formally, let \mathbb{E}_0[\phi(X)] and \mathbb{E}_1[\phi(X)] be the expectation of \phi(X) under H_0 and H_1 respectively. The quantity \mathbb{E}_0[\phi(X)] is the probability we reject H_0 when H_0 is true. Likewise, the quantity \mathbb{E}_1[\phi(X)] is the probability we reject H_0 when H_1 is true. An optimal test would be one that minimises \mathbb{E}_0[\phi(X)] and maximises \mathbb{E}_1[\phi(X)].

Unfortunately the goals of minimising \mathbb{E}_0[\phi(X)] and maximising \mathbb{E}_1[\phi(X)] are at odds with one another. This is because we want \phi to be small in order to minimise \mathbb{E}_0[\phi(X)] and we want \phi to be large to maximise \mathbb{E}_1[\phi(X)]. In nearly all cases we have to trade off between these two goals and there is no single test that simultaneously achieves both.

To work around this, a standard approach is to focus on maximising \mathbb{E}_1[\phi(X)] while requiring that \mathbb{E}_0[\phi(X)] remains below some threshold. The quantity \mathbb{E}_1[\phi(X)] is called the power of the test \phi. If \alpha is a number between 0 and 1, we will say that \phi has level\alpha if \mathbb{E}_1[\phi(X)] \le \alpha. A test \phi is said to be most powerful at level-\alpha, if

  • The test \phi is level-\alpha.
  • For all level-\alpha tests \phi', the test \phi is more powerful than \phi'. That is,

\mathbb{E}_1[\phi'(X)] \le \mathbb{E}_1[\phi(X)].

Thus we can see that finding a most powerful level-\alpha test is a constrained optimisation problem. We wish to maximise the quantity

\mathbb{E}_1[\phi(X)]

subject to the constraint

\mathbb{E}_0[\phi(X)] \le \alpha

With this in mind, we turn to the method of undetermined multipliers.

The method of undetermined multipliers

The method of undetermined multipliers (also called the method of Lagrange multipliers) is a very general optimisation tool. Suppose that we have a set U and two function f,g : U \to \mathbb{R} and we wish to maximise f(u) subject to the constraint g(u) \le 0.

In the context of hypothesis testing, the set U is the set of all tests \phi. The objective function f is given by f(\phi) = \mathbb{E}_1[\phi(X)]. That is, f(\phi) is the power of the test \phi. The constraint function g is given by g(\phi)=\mathbb{E}_1[\phi(X)]-\alpha so that g(\phi) \le 0 if and only if \phi has level-\alpha.

The method of undetermined multipliers allows us to reduce this constrained optimisation problem to a (hopefully easier) unconstrained optimisation problem. More specifically we have the following result:

Proposition: Suppose that u^* \in U is such that:

  • g(u^*) = 0,
  • There exists k \ge 0, such that u^* maximises f(u)-kg(u) over all u \in U.

Then u maximises f(u) under the constraint g(u) \le 0.

Proof: Suppose that u^* satisfies the above two dot points. We need to show that for all u \in U, if g(u) \le 0, then f(u) \le f(u^*). By assumption we know that f(u^*)=0 and u^* maximises f(u)-kg(u). Thus, for all u \in U,

f(u^*)=f(u^*)-kg(u^*) \ge f(u)-kg(u).

Now suppose g(u) \le 0. Then, kg(u) \le 0 and so f(u)-kg(u)\ge f(u) and so f(u^*) \ge f(u) as needed. \square

The constant k is the undetermined multiplier. The way to use the method of undetermined is to find values u_k^* that maximise h_k(u) = f(u)-kg(u) for each k \ge 0. The multiplier k is then varied so that the constraint g(u^*_k) = 0 is satisfied.

Proving the Neyman-Pearson lemma

Now let’s use the method of undetermined multipliers to find most powerful tests. Recall the set U which we are optimising over is the set of all tests \phi. Recall also that we wish to optimise f(\phi) = \mathbb{E}_1[\phi(X)] subject to the constraint g(\phi) = \mathbb{E}_0[\phi(X)] - \alpha \le 0. The method of undetermined multipliers says that we should consider maximising the function

h_k(\phi) = \mathbb{E}_1[\phi(X)] - k\mathbb{E}_0[\phi(X)],

where k \ge 0. Suppose that both \mathbb{P}_0 and \mathbb{P}_1 have densities1 p_0 and p_1 with respect to some measure \mu. We can we can write the above expectations as integrals. That is,

\mathbb{E}_0[\phi(X)] = \int \phi(x)p_0(x)\mu(dx) and \mathbb{E}_1[\phi(X)] = \int \phi(x)p_1(x)\mu(dx).

Thus the function h_k is equal to

h_k(\phi) =  \int \phi(x)p_1(x)\mu(dx)- k\int \phi(x)p_0(x)\mu(dx)=\int \phi(x)(p_1(x)-kp_0(x))\mu(dx).

We now wish to maximise the function h_k(\phi). Recall that \phi is a function that take values in [0,1]. Thus, the integral

\int \phi(x)(p_1(x)-kp_0(x))\mu(dx),

is maximised if and only if \phi(x)=1 when p_1(x)-kp_0(x) > 0 and \phi(x)=0 when p_1(x)-kp_0(x) < 0. Note that p_1(x)-kp_0(x) > 0 if and only if \frac{p_1(x)}{p_0(x)} > k. Thus for each k \ge 0, a test \phi_k maximises h_k(\phi) if and only if

\phi_k(x) = \begin{cases} 1 & \text{if } \frac{p_1(x)}{p_0(x)} > k,\\ 0 &\text{if } \frac{p_1(x)}{p_0(x)} < k. \end{cases}

The method of undetermined multipliers says that if we can find k so that the above is satisfied and g(\phi_k) = 0, then \phi_k is a most powerful test. Recall that g(\phi_k)=0 is equivalent to \mathbb{E}_1[\phi(X)]=\alpha. By summarising the above argument, we arrive at the Neyman-Pearson lemma,

Neyman-Pearson Lemma2: Suppose that \phi is a test such that

  • \mathbb{E}_0[\phi(X)] = \alpha, and
  • For some k \ge 0, \phi(x) = \begin{cases} 1 & \text{if } \frac{p_1(x)}{p_0(x)} > k,\\ 0 & \text{if } \frac{p_1(x)}{p_0(x)}< k.\end{cases}

then \phi is most powerful at level-\alpha for testing H_0 : X \sim \mathbb{P}_0 against H_1 : X \sim \mathbb{P}_1.

The magic of Neyman-Pearson

By learning about undetermined multipliers I’ve been able to better understand the proof of the Neyman-Pearson lemma. I now view it is as clever solution to a constrained optimisation problem rather than something that comes out of nowhere.

There is, however, a different aspect of Neyman-Pearson that continues to surprise me. This aspect is the usefulness of the lemma. At first glance the Neyman-Pearson lemma seems to be a very specialised result because it is about simple hypothesis testing. In reality most interesting hypothesis tests have composite nulls or composite alternatives or both. It turns out that Neyman-Pearson is still incredibly useful for composite testing. Through ideas like monotone likelihood ratios, least favourable distributions and unbiasedness, the Neyman-Pearson lemma or similar ideas can be used to find optimal tests in a variety of settings.

Thus I must admit that the title of this blog post is a little inaccurate and deceptive. I do believe that, given the tools of undetermined multipliers and the set up of simple hypothesis testing, one is naturally led to the Neyman-Pearson lemma. However, I don’t believe that many could have realised how useful and interesting simple hypothesis testing would be.

Footnotes

  1. The assumption that \mathbb{P}_0 and \mathbb{P}_1 have densities with respect to a common measure \mu is not a restrictive assumption since one can always take \mu = \mathbb{P}_0+\mathbb{P}_1 and the apply Radon-Nikodym. However there is often a more natural choice of \mu such as Lebesgue measure on \mathbb{R}^d or the counting measure on \mathbb{N}.
  2. What I call the Neyman-Pearson lemma is really only a third of the Neyman-Pearson lemma. There are two other parts. One that guarantees the existence of a most powerful test and one that is a partial converse to the statement in this post.