Two sample tests as correlation tests

Suppose we have two samples $Y_1^{(0)}, Y_2^{(0)},\ldots, Y_{n_0}^{(0)}$ and $Y_1^{(1)},Y_2^{(1)},\ldots, Y_{n_1}^{(1)}$ and we want to test if they are from the same distribution. Many popular tests can be reinterpreted as correlation tests by pooling the two samples and introducing a dummy variable that encodes which sample each data point comes from. In this post we will see how this plays out in a simple t-test.

The equal variance t-test

In the equal variance t-test, we assume that $Y_i^{(0)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_0,\sigma^2)$ and $Y_i^{(1)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_1,\sigma^2)$, where $\sigma^2$ is unknown. Our hypothesis that $Y_1^{(0)}, Y_2^{(0)},\ldots, Y_{n_0}^{(0)}$ and $Y_1^{(1)},Y_2^{(1)},\ldots, Y_{n_1}^{(1)}$ are from the same distribution becomes the hypothesis $\mu_0 = \mu_1$. The test statistic is

$t = \frac{\displaystyle \overline{Y}^{(1)} - \overline{Y}^{(0)}}{\displaystyle \hat{\sigma}\sqrt{\frac{1}{n_0}+\frac{1}{n_1}}}$,

where $\overline{Y}^{(0)}$ and $\overline{Y}^{(1)}$ are the two sample means. The variable $\hat{\sigma}$ is the pooled estimate of the standard deviation and is given by

$\hat{\sigma}^2 = \displaystyle\frac{1}{n_0+n_1-2}\left(\sum_{i=1}^{n_0}\left(Y_i^{(0)}-\overline{Y}^{(0)}\right)^2 + \sum_{i=1}^{n_1}\left(Y_i^{(1)}-\overline{Y}^{(1)}\right)^2\right)$.

Under the null hypothesis, $t$ follows the T-distribution with $n_0+n_1-2$ degrees of freedom. We thus reject the null $\mu_0=\mu_1$ when $|t|$ exceeds the $1-\alpha/2$ quantile of the T-distribution.

Pooling the data

We can turn this two sample test into a correlation test by pooling the data and using a linear model. Let $Y_1,\ldots,Y_{n_0}, Y_{n_0+1},\ldots,Y_{n_0+n_1}$ be the pooled data and for $i = 1,\ldots, n_0+n_1$, define $x_i \in \{0,1\}$ by

$x_i = \begin{cases} 0 & \text{if } 1 \le i \le n_0,\\ 1 & \text{if } n_0+1 \le i \le n_0+n_1.\end{cases}$

The assumptions that $Y_i^{(0)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_0,\sigma^2)$ and $Y_i^{(1)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_1,\sigma^2)$ can be rewritten as

$Y_i = \beta_0+\beta_1x_i + \varepsilon_i,$

where $\varepsilon_i \stackrel{\text{iid}}{\sim} \mathcal{N}(0,\sigma^2)$. That is, we have expressed our modelling assumptions as a linear model. When working with this linear model, the hypothesis $\mu_0 = \mu_1$ is equivalent to $\beta_1 = 0$. To test $\beta_1 = 0$ we can use the standard t-test for a coefficient in linear model. The test statistic in this case is

$t' = \displaystyle\frac{\hat{\beta}_1}{\hat{\sigma}_{OLS}\sqrt{(X^TX)^{-1}_{11}}},$

where $\hat{\beta}_1$ is the ordinary least squares estimate of $\beta_1$, $X \in \mathbb{R}^{(n_0+n_1)\times 2}$ is the design matrix and $\hat{\sigma}_{OLS}$ is an estimate of $\sigma$ given by

$\hat{\sigma}_{OLS}^2 = \displaystyle\frac{1}{n_0+n_1-2}\sum_{i=1}^{n_0+n_1} (Y_i-\hat{Y}_i)^2,$

where $\hat{Y} = \hat{\beta}_0+\hat{\beta}_1x_i$ is the fitted value of $Y_i$.

It turns out that $t'$ is exactly equal to $t$. We can see this by writing out the design matrix and calculating everything above. The design matrix has rows $[1,x_i]$ and is thus equal to

$X = \begin{bmatrix} 1&x_1\\ 1&x_2\\ \vdots&\vdots\\ 1&x_{n_0}\\ 1&x_{n_0+1}\\ \vdots&\vdots\\ 1&x_{n_0+n_1}\end{bmatrix} = \begin{bmatrix} 1&0\\ 1&0\\ \vdots&\vdots\\ 1&0\\ 1&1\\ \vdots&\vdots\\ 1&1\end{bmatrix}.$

This implies that

$X^TX = \begin{bmatrix} n_0+n_1 &n_1\\n_1&n_1 \end{bmatrix},$

And therefore,

$(X^TX)^{-1} = \frac{1}{(n_0+n_1)n_1-n_1^2}\begin{bmatrix} n_1 &-n_1\\-n_1&n_0+n_1 \end{bmatrix} = \frac{1}{n_0n_1}\begin{bmatrix} n_1&-n_1\\-n_1&n_0+n_1\end{bmatrix} =\begin{bmatrix} \frac{1}{n_0}&-\frac{1}{n_0}\\-\frac{1}{n_0}&\frac{1}{n_0}+\frac{1}{n_1}\end{bmatrix} .$

Thus, $(X^TX)^{-1}_{11} = \frac{1}{n_0}+\frac{1}{n_1}$. So,

$t' = \displaystyle\frac{\hat{\beta}_1}{\hat{\sigma}_{OLS}\sqrt{\frac{1}{n_0}+\frac{1}{n_1}}},$

which is starting to like $t$ from the two-sample test. Now

$X^TY = \begin{bmatrix} \displaystyle\sum_{i=1}^{n_0+n_1} Y_i\\ \displaystyle \sum_{i=n_0+1}^{n_0+n_1} Y_i \end{bmatrix} = \begin{bmatrix} n_0\overline{Y}^{(0)} + n_1\overline{Y}^{(1)}\\ n_1\overline{Y}^{(1)} \end{bmatrix}.$

And so

$\hat{\beta} = (X^TX)^{-1}X^TY = \begin{bmatrix} \frac{1}{n_0}&-\frac{1}{n_0}\\-\frac{1}{n_0}&\frac{1}{n_0}+\frac{1}{n_1}\end{bmatrix}\begin{bmatrix} n_0\overline{Y}^{(0)} + n_1\overline{Y}^{(1)}\\ n_1\overline{Y}^{(1)} \end{bmatrix}=\begin{bmatrix} \overline{Y}^{(0)}\\ \overline{Y}^{(1)} -\overline{Y}^{(0)}\end{bmatrix}.$

Thus, $\hat{\beta}_1 = \overline{Y}^{(1)} -\overline{Y}^{(0)}$ and

$t' = \displaystyle\frac{\overline{Y}^{(1)}-\overline{Y}^{(0)}}{\hat{\sigma}_{OLS}\sqrt{\frac{1}{n_0}+\frac{1}{n_1}}}.$

This means to show that $t' = t$, we only need to show that $\hat{\sigma}_{OLS}^2=\hat{\sigma}^2$. To do this, note that the fitted values $\hat{Y}$ are equal to

$\displaystyle\hat{Y}_i=\hat{\beta}_0+x_i\hat{\beta}_1 = \begin{cases} \overline{Y}^{(0)} & \text{if } 1 \le i \le n_0,\\ \overline{Y}^{(1)} & \text{if } n_0+1\le i \le n_0+n_1\end{cases}.$

Thus,

$\hat{\sigma}^2_{OLS} = \displaystyle\frac{1}{n_0+n_1-2}\sum_{i=1}^{n_0+n_1}\left(Y_i-\hat{Y}_i\right)^2=\displaystyle\frac{1}{n_0+n_1-2}\left(\sum_{i=1}^{n_0}\left(Y_i^{(0)}-\overline{Y}^{(0)}\right)^2 + \sum_{i=1}^{n_1}\left(Y_i^{(1)}-\overline{Y}^{(1)}\right)^2\right),$

Which is exactly $\hat{\sigma}^2$. Therefore, $t'=t$ and the two sample t-test is equivalent to a correlation test.

The Friedman-Rafsky test

In the above example, we saw that the two sample t-test was a special case of the t-test for regressions. This is neat but both tests make very strong assumptions about the data. However, the same thing happens in a more interesting non-parametric setting.

In their 1979 paper, Jerome Friedman and Lawrence Rafsky introduced a two sample tests that makes no assumptions about the distribution of the data. The two samples do not even have to real-valued and can instead be from any metric space. It turns out that their test is a special case of another procedure they devised for testing for association (Friedman & Rafsky, 1983). As with the t-tests above, this connection comes from pooling the two samples and introducing a dummy variable.

I plan to write a follow up post explaining these procedures but you can also read about it in Chapter 6 of Group Representations in Probability and Statistics by Persi Diaconis.

References

Persi Diaconis “Group representations in probability and statistics,” pp 104-106, Hayward, CA: Institute of Mathematical Statistics, (1988)

Jerome H. Friedman, Lawrence C. Rafsky “Multivariate Generalizations of the Wald-Wolfowitz and Smirnov Two-Sample Tests,” The Annals of Statistics, Ann. Statist. 7(4), 697-717, (July, 1979)

Jerome H. Friedman, Lawrence C. Rafsky “Graph-Theoretic Measures of Multivariate Association and Prediction,” The Annals of Statistics, Ann. Statist. 11(2), 377-391, (June, 1983).