Two sample tests as correlation tests

Suppose we have two samples Y_1^{(0)}, Y_2^{(0)},\ldots, Y_{n_0}^{(0)} and Y_1^{(1)},Y_2^{(1)},\ldots, Y_{n_1}^{(1)} and we want to test if they are from the same distribution. Many popular tests can be reinterpreted as correlation tests by pooling the two samples and introducing a dummy variable that encodes which sample each data point comes from. In this post we will see how this plays out in a simple t-test.

The equal variance t-test

In the equal variance t-test, we assume that Y_i^{(0)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_0,\sigma^2) and Y_i^{(1)} \stackrel{\text{iid}}{\sim}  \mathcal{N}(\mu_1,\sigma^2), where \sigma^2 is unknown. Our hypothesis that Y_1^{(0)}, Y_2^{(0)},\ldots, Y_{n_0}^{(0)} and Y_1^{(1)},Y_2^{(1)},\ldots, Y_{n_1}^{(1)} are from the same distribution becomes the hypothesis \mu_0 = \mu_1. The test statistic is

t = \frac{\displaystyle \overline{Y}^{(1)} - \overline{Y}^{(0)}}{\displaystyle \hat{\sigma}\sqrt{\frac{1}{n_0}+\frac{1}{n_1}}},

where \overline{Y}^{(0)} and \overline{Y}^{(1)} are the two sample means. The variable \hat{\sigma} is the pooled estimate of the standard deviation and is given by

\hat{\sigma}^2 = \displaystyle\frac{1}{n_0+n_1-2}\left(\sum_{i=1}^{n_0}\left(Y_i^{(0)}-\overline{Y}^{(0)}\right)^2 + \sum_{i=1}^{n_1}\left(Y_i^{(1)}-\overline{Y}^{(1)}\right)^2\right).

Under the null hypothesis, t follows the T-distribution with n_0+n_1-2 degrees of freedom. We thus reject the null \mu_0=\mu_1 when |t| exceeds the 1-\alpha/2 quantile of the T-distribution.

Pooling the data

We can turn this two sample test into a correlation test by pooling the data and using a linear model. Let Y_1,\ldots,Y_{n_0}, Y_{n_0+1},\ldots,Y_{n_0+n_1} be the pooled data and for i = 1,\ldots, n_0+n_1, define x_i \in \{0,1\} by

x_i = \begin{cases} 0 & \text{if } 1 \le i \le n_0,\\ 1 & \text{if } n_0+1 \le i \le n_0+n_1.\end{cases}

The assumptions that Y_i^{(0)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_0,\sigma^2) and Y_i^{(1)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_1,\sigma^2) can be rewritten as

Y_i = \beta_0+\beta_1x_i + \varepsilon_i,

where \varepsilon_i \stackrel{\text{iid}}{\sim} \mathcal{N}(0,\sigma^2). That is, we have expressed our modelling assumptions as a linear model. When working with this linear model, the hypothesis \mu_0 = \mu_1 is equivalent to \beta_1 = 0. To test \beta_1 = 0 we can use the standard t-test for a coefficient in linear model. The test statistic in this case is

t' = \displaystyle\frac{\hat{\beta}_1}{\hat{\sigma}_{OLS}\sqrt{(X^TX)^{-1}_{11}}},

where \hat{\beta}_1 is the ordinary least squares estimate of \beta_1, X \in \mathbb{R}^{(n_0+n_1)\times 2} is the design matrix and \hat{\sigma}_{OLS} is an estimate of \sigma given by

\hat{\sigma}_{OLS}^2 = \displaystyle\frac{1}{n_0+n_1-2}\sum_{i=1}^{n_0+n_1} (Y_i-\hat{Y}_i)^2,

where \hat{Y} = \hat{\beta}_0+\hat{\beta}_1x_i is the fitted value of Y_i.

It turns out that t' is exactly equal to t. We can see this by writing out the design matrix and calculating everything above. The design matrix has rows [1,x_i] and is thus equal to

X = \begin{bmatrix} 1&x_1\\ 1&x_2\\  \vdots&\vdots\\ 1&x_{n_0}\\  1&x_{n_0+1}\\ \vdots&\vdots\\ 1&x_{n_0+n_1}\end{bmatrix} = \begin{bmatrix} 1&0\\ 1&0\\  \vdots&\vdots\\ 1&0\\  1&1\\ \vdots&\vdots\\ 1&1\end{bmatrix}.

This implies that

X^TX = \begin{bmatrix} n_0+n_1 &n_1\\n_1&n_1 \end{bmatrix},

And therefore,

(X^TX)^{-1} = \frac{1}{(n_0+n_1)n_1-n_1^2}\begin{bmatrix} n_1 &-n_1\\-n_1&n_0+n_1 \end{bmatrix} = \frac{1}{n_0n_1}\begin{bmatrix} n_1&-n_1\\-n_1&n_0+n_1\end{bmatrix} =\begin{bmatrix} \frac{1}{n_0}&-\frac{1}{n_0}\\-\frac{1}{n_0}&\frac{1}{n_0}+\frac{1}{n_1}\end{bmatrix} .

Thus, (X^TX)^{-1}_{11} = \frac{1}{n_0}+\frac{1}{n_1}. So,

t' = \displaystyle\frac{\hat{\beta}_1}{\hat{\sigma}_{OLS}\sqrt{\frac{1}{n_0}+\frac{1}{n_1}}},

which is starting to like t from the two-sample test. Now

X^TY = \begin{bmatrix} \displaystyle\sum_{i=1}^{n_0+n_1} Y_i\\ \displaystyle \sum_{i=n_0+1}^{n_0+n_1} Y_i \end{bmatrix} = \begin{bmatrix} n_0\overline{Y}^{(0)} + n_1\overline{Y}^{(1)}\\  n_1\overline{Y}^{(1)} \end{bmatrix}.

And so

\hat{\beta} = (X^TX)^{-1}X^TY = \begin{bmatrix} \frac{1}{n_0}&-\frac{1}{n_0}\\-\frac{1}{n_0}&\frac{1}{n_0}+\frac{1}{n_1}\end{bmatrix}\begin{bmatrix} n_0\overline{Y}^{(0)} + n_1\overline{Y}^{(1)}\\  n_1\overline{Y}^{(1)} \end{bmatrix}=\begin{bmatrix} \overline{Y}^{(0)}\\  \overline{Y}^{(1)} -\overline{Y}^{(0)}\end{bmatrix}.

Thus, \hat{\beta}_1 = \overline{Y}^{(1)} -\overline{Y}^{(0)} and

t' = \displaystyle\frac{\overline{Y}^{(1)}-\overline{Y}^{(0)}}{\hat{\sigma}_{OLS}\sqrt{\frac{1}{n_0}+\frac{1}{n_1}}}.

This means to show that t' = t, we only need to show that \hat{\sigma}_{OLS}^2=\hat{\sigma}^2. To do this, note that the fitted values \hat{Y} are equal to

\displaystyle\hat{Y}_i=\hat{\beta}_0+x_i\hat{\beta}_1 = \begin{cases} \overline{Y}^{(0)} & \text{if } 1 \le i \le n_0,\\ \overline{Y}^{(1)} & \text{if } n_0+1\le i \le n_0+n_1\end{cases}.

Thus,

\hat{\sigma}^2_{OLS} = \displaystyle\frac{1}{n_0+n_1-2}\sum_{i=1}^{n_0+n_1}\left(Y_i-\hat{Y}_i\right)^2=\displaystyle\frac{1}{n_0+n_1-2}\left(\sum_{i=1}^{n_0}\left(Y_i^{(0)}-\overline{Y}^{(0)}\right)^2 + \sum_{i=1}^{n_1}\left(Y_i^{(1)}-\overline{Y}^{(1)}\right)^2\right),

Which is exactly \hat{\sigma}^2. Therefore, t'=t and the two sample t-test is equivalent to a correlation test.

The Friedman-Rafsky test

In the above example, we saw that the two sample t-test was a special case of the t-test for regressions. This is neat but both tests make very strong assumptions about the data. However, the same thing happens in a more interesting non-parametric setting.

In their 1979 paper, Jerome Friedman and Lawrence Rafsky introduced a two sample tests that makes no assumptions about the distribution of the data. The two samples do not even have to real-valued and can instead be from any metric space. It turns out that their test is a special case of another procedure they devised for testing for association (Friedman & Rafsky, 1983). As with the t-tests above, this connection comes from pooling the two samples and introducing a dummy variable.

I plan to write a follow up post explaining these procedures but you can also read about it in Chapter 6 of Group Representations in Probability and Statistics by Persi Diaconis.

References

Persi Diaconis¬†“Group representations in probability and statistics,” pp 104-106, Hayward, CA: Institute of Mathematical Statistics, (1988)

Jerome H. Friedman, Lawrence C. Rafsky “Multivariate Generalizations of the Wald-Wolfowitz and Smirnov Two-Sample Tests,” The Annals of Statistics, Ann. Statist. 7(4), 697-717, (July, 1979)

Jerome H. Friedman, Lawrence C. Rafsky “Graph-Theoretic Measures of Multivariate Association and Prediction,” The Annals of Statistics, Ann. Statist. 11(2), 377-391, (June, 1983).

Leverages Scores

I am very excited to be writing a blog post again – it has been nearly a year! This post marks a new era for the blog. In September I started a statistics PhD at Stanford University. I am really enjoying my classes and I am learning a lot. I might have to change the name of the blog soon but for now let’s stick with “Maths to Share” although you will undoubtedly see more and more statistics here.

Today I would like to talk about leverages scores. Leverages scores are a way to quantify how sensitive a model is and they can be used to explain the different behaviour in these two animations

Linear Models

I recently learnt about leverage scores in the applied statistics course STATS 305A. This course is all about the linear model. In the linear model we assume with have n data points (x_i,y_i) where x_i is a vector in \mathbb{R}^d and y_i is a number in \mathbb{R}. We model y_i as a linear function of x_i plus noise. That is we assume y_i = x_i^T\beta + \varepsilon_i, where \beta \in \mathbb{R}^d is a unknown vector of coefficients and \varepsilon_i is a random variable with mean 0 and variance \sigma^2. We also require that for i \neq j, the random variable \varepsilon_i is uncorrelated with \varepsilon_j.

We can also write this as a matrix equation. Define y to be the vector with entries y_i and define X to be the matrix with rows x_i^T, that is

y = \begin{bmatrix} y_1\\ y_2 \\ \vdots \\ y_n \end{bmatrix} \in \mathbb{R}^n and X = \begin{bmatrix} -x_1^T-\\-x_2^T-\\ \vdots \\ -x_n^T-\end{bmatrix} \in \mathbb{R}^{n \times d}.

Then our model can be rewritten as

y = X\beta + \varepsilon,

where \varepsilon \in \mathbb{R}^n is a random vector with mean 0 and covariance matrix \sigma^2 I_n. To simplify calculations we will assume that X contains an intercept term. This means that the first column of X consists of all 1’s.

In the two animations at the start of this post we have two nearly identical data sets. The data sets are an example of simple regression when each vector x_i is of the form (1,z_i) where z_i is a number. The values z_i are on the horizontal axis and the values y_i are on the vertical axis.

Estimating the coefficients

In the linear model we wish to estimate the parameter \beta which contains the coefficients of our model. That is, given a sample (y_i,x_i)_{i=1}^n, we wish to construct a vector \widehat{\beta} which approximates the true parameter \beta. In ordinary least square regression we choose \widehat{\beta} to be the vector b \in \mathbb{R}^d that minimizes the quantity

\sum_{i=1}^n (x_i^T b - y_i)^2=\left \Vert Xb - y \right \Vert_2^2.

Differentiating with respect to b and setting the derivative equal to 0 shows that \widehat{\beta} is a solution to the normal equations:

X^TXb = X^T y.

We will assume that the matrix X^TX is invertible. In this case then the normal equations have a unique solution \widehat{\beta} = (X^TX)^{-1}X^T y.

Now that we have our estimate \widehat{\beta}, we can do prediction. If we are given a new value x' \in \mathbb{R}^d we would use x'^T\widehat{\beta} to predict the corresponding value of y'. This was how the straight lines in the two animations were calculated.

We can also calculate the model’s predicted values for the data x_i that we used to fit the model. These are denoted by \widehat{y}. Note that

\widehat{y} = X\widehat{\beta} = X(X^TX)^{-1}X^Ty = Hy,

where H = X(X^TX)^{-1}X^T is called the hat matrix for the model (since it puts the hat \widehat{ } on y.

Leverage scores

We are now ready to talk about leverage scores and the two animations. For reference, here they are again:

In both animations the stationary line corresponds to an estimator \widehat{\beta} that was calculated using only the black data points. The red points are new data points with different x values and varying y values. The moving line corresponds to an estimator \widehat{\beta} calculated using the red data point as well as all the black points. We can see immediately that if the red point is far away from the “bulk” of the other x points, then the moving line is a lot more sensitive to the y value of the red point.

The leverage score of a data point (x_i,y_i) is defined to be \frac{\partial \widehat{y}_i}{\partial y_i}. That is, the leverage score tells us how much does the prediction \widehat{y}_i change if we change y_i.

Since \widehat{y} = Hy, the leverage score of (x_i,y_i) is H_{ii}, the i^{th} diagonal element of the hat matrix H. The idea is that if a data point (x_i,y_i) has a large leverage score, then the model is more sensitive to changes in that value of y_i.

This can be seen in a leave one out calculation. This calculation tells us what we should expect if we make a leave-one-out model – a model that uses all the data points apart from one. In our animations, this corresponds to the stationary line.

The leave one out calculation says that the predicted value using all the data is always between the true value and the predicted value from the leave-one-out model. In our animations this can be seen by noting that the moving line (the full model) is always between the red point (the true value) and the stationary line (the leave-one-out model).

Furthermore the leverage score tells us exactly how close the predicted value is to the true value. We can see that the moving line is much closer to the red dot in the high leverage example on the right than the low leverage example on the left.

Mahalanobis distance

We now know that the two animations are showing the sensitivity of a model to two different data points. We know that a model is more sensitive to point with high leverage than to points with low leverage. We still haven’t spoken about why some point have higher leverage and why the point on the right has higher leverage.

It turns out that leverage score are measuring how far away a data point is from the “bulk” of the other x_i‘s. More specifically in a one dimensional example like what we have in the animations

H_{ii} = \frac{1}{n}\left(\frac{1}{S^2}(x_i-\bar{x})^2+1\right),

where n is the number of data points, \bar{x} = \frac{1}{n}\sum_{j=1}^n x_j is the sample mean and S^2 = \frac{1}{n}\sum_{j=1}^n (x_j-\bar{x})^2 is the sample variance. Thus high leverage scores correspond to points that are far away from the centre of our data x_i. In higher dimensions a similar result holds if we measure distance using Mahalanobis distance.

The mean of the black data points is approximately 2 and so we can now see why the second point has higher leverage. The two animations were made in Geogebra. You can play around with them here and here.