Suppose we have two samples and and we want to test if they are from the same distribution. Many popular tests can be reinterpreted as correlation tests by pooling the two samples and introducing a dummy variable that encodes which sample each data point comes from. In this post we will see how this plays out in a simple t-test.
The equal variance t-test
In the equal variance t-test, we assume that and , where is unknown. Our hypothesis that and are from the same distribution becomes the hypothesis . The test statistic is
where and are the two sample means. The variable is the pooled estimate of the standard deviation and is given by
Under the null hypothesis, follows the T-distribution with degrees of freedom. We thus reject the null when exceeds the quantile of the T-distribution.
Pooling the data
We can turn this two sample test into a correlation test by pooling the data and using a linear model. Let be the pooled data and for , define by
The assumptions that and can be rewritten as
where . That is, we have expressed our modelling assumptions as a linear model. When working with this linear model, the hypothesis is equivalent to . To test we can use the standard t-test for a coefficient in linear model. The test statistic in this case is
where is the ordinary least squares estimate of , is the design matrix and is an estimate of given by
where is the fitted value of .
It turns out that is exactly equal to . We can see this by writing out the design matrix and calculating everything above. The design matrix has rows and is thus equal to
This implies that
Thus, . So,
which is starting to like from the two-sample test. Now
This means to show that , we only need to show that . To do this, note that the fitted values are equal to
Which is exactly . Therefore, and the two sample t-test is equivalent to a correlation test.
The Friedman-Rafsky test
In the above example, we saw that the two sample t-test was a special case of the t-test for regressions. This is neat but both tests make very strong assumptions about the data. However, the same thing happens in a more interesting non-parametric setting.
In their 1979 paper, Jerome Friedman and Lawrence Rafsky introduced a two sample tests that makes no assumptions about the distribution of the data. The two samples do not even have to real-valued and can instead be from any metric space. It turns out that their test is a special case of another procedure they devised for testing for association (Friedman & Rafsky, 1983). As with the t-tests above, this connection comes from pooling the two samples and introducing a dummy variable.
I plan to write a follow up post explaining these procedures but you can also read about it in Chapter 6 of Group Representations in Probability and Statistics by Persi Diaconis.
Persi Diaconis “Group representations in probability and statistics,” pp 104-106, Hayward, CA: Institute of Mathematical Statistics, (1988)
Jerome H. Friedman, Lawrence C. Rafsky “Multivariate Generalizations of the Wald-Wolfowitz and Smirnov Two-Sample Tests,” The Annals of Statistics, Ann. Statist. 7(4), 697-717, (July, 1979)
Jerome H. Friedman, Lawrence C. Rafsky “Graph-Theoretic Measures of Multivariate Association and Prediction,” The Annals of Statistics, Ann. Statist. 11(2), 377-391, (June, 1983).