Doing Calvin’s homework

Growing up, my siblings and I would read a lot of Bill Watterson’s Calvin and Hobbes. I have fond memories of spending hours reading and re-reading our grandparents collection during the school holidays. The comic strip is witty, heartwarming and beautifully drawn.

Recently, I came across this strip online

Originally published Feb 5th, 1990. You can read the rest of the story arc here:

Seeing Calvin take this test took me back to those school holidays. I remember grabbing paper and a pencil to work out the answer (I also remember getting teased by those previously mentioned siblings). Reading the strip years later, I realized there’s a neat way to work out the answer without having to write much down. Here’s the question again,

Jack and Joe leave their homes at the same time and drive toward each other. Jack drives at 60 mph, while Joe drives at 30 mph. They pass each other in 10 minutes.

How far apart were Jack and Joe when they started?

It can be hard to think about Jack and Joe traveling at different speeds. But we can simplify the problem by thinking from Joe’s perspective or frame of reference. From Joe’s frame of reference, he is staying perfectly still but Jack is zooming towards him at 30+60=90 mph. Jack reaches Joe in 10 minutes which is one sixth of an hour. This means that the initial distance between them must have been 90 \times 1/6 = 15 miles.

It’s not as creative as Calvin’s private eye approach but at least Susie and I got the same answer.

Bernoulli’s inequality and probability

Suppose we have independent events E_1,E_2,\ldots,E_m, each of which occur with probability 1-\varepsilon. The event that all of the E_i occur is E = \bigcap_{i=1}^m E_i. By using independence we can calculate the probability of E,

P(E) = P\left(\bigcap_{i=1}^m E_i\right) = \prod_{i=1}^m P(E_i) = (1-\varepsilon)^m.

We could also get a lower bound on P(E) by using the union bound. This gives,

P(E) = 1-P(E^c) = 1-P\left(\bigcup_{i=1}^m E_i^c\right) \ge `1-\sum_{i=1}^m P(E_i^c) = 1-m\varepsilon.

We can therefore conclude that (1-\varepsilon)^m \ge 1-m\varepsilon. This is an instance of Bernoulli’s inequality. More generally, Bernoulli’s inequality states that

(1+x)^m \ge 1+mx,

for all x \ge -1 and m \ge 1. This inequality states the red line is always underneath the black curve in the below picture. For an interactive version of this graph where you can change the value of m, click here.

Our probabilistic proof only applies to that case when x = -\varepsilon is between -1 and 0 and m is an integer. The more general result can be proved by using convexity. For m \ge 1, the function f(x) = (1+x)^m is convex on the set x > -1. The function g(x) = 1+mx is the tangent line of this function at the point x=0. Convexity of f(x) means that the graph of f(x) is always above the tangent line g(x). This tells us that (1+x)^m \ge 1+mx.

For m between 0 and 1, the function (1+x)^m is no longer convex but actually concave and the inequality reverses. For m \le 0, (1+x)^m becomes concave again. These two cases are visualized below. In the first picture m = 0.5 and the red line is above the black one. In the second picture m = -1 and the black line is back on top.

Solving a system of equations vs inverting a matrix

I used to have trouble understanding why inverting an n \times n matrix required the same order of operations as solving an n \times n system of linear equations. Specifically, if A is an n\times n invertible matrix and b is a length n vector, then computing A^{-1} and solving the equation Ax = b can both be done in O(n^3) floating point operations (flops). This surprised me because naively computing the columns of A^{-1} requires solving the n systems of equations

Ax = e_1, Ax=e_2,\ldots, Ax = e_n,

where e_1,e_2,\ldots, e_n are the standard basis vectors. I thought this would mean that calculating A^{-1} would require roughly n times as many flops as solving a single system of equations. It was only in a convex optimization lecture that I realized what was going on.

To solve a single system of equations Ax = b, we first compute a factorization of A such as the LU factorization. Computing this factorization takes O(n^3) flops but once we have it, we can use it solve any new system with O(n^2) flops. Now to solve Ax=e_1,Ax=e_2,\ldots,Ax=e_n, we can simply factor A once and the perform n solves using the factorization each of which requires an addition O(n^2) flops. The total flop count is then O(n^3)+nO(n^2)=O(n^3). Inverting the matrix requires the same order of flops as a single solve!

Of course, as John D Cook points out: you shouldn’t ever invert a matrix. Even if inverting and solving take the same order of flops, inverting is still more computational expensive and requires more memory.

Related posts

Braids and the Yang-Baxter equation

I recently gave a talk on the Yang-Baxter equation. The focus of the talk was to state the connection between the Yang-Baxter equation and the braid relation. This connection comes from a system of interacting particles. In this post, I’ll go over part of my talk. You can access the full set of notes here.

Interacting particles

Imagine two particles on a line, each with a state that can be any element of a set \mathcal{X}. Suppose that the only way particles can change their states is by interacting with each other. An interaction occurs when two particles pass by each other. We could define a function F : \mathcal{X} \times \mathcal{X} \to \mathcal{X} \times \mathcal{X} that describes how the states change after interaction. Specifically, if the first particle is in state x and the second particle is in state y, then their states after interacting will be

F(x,y) = (F_a(x,y), F_b(x,y)) = (\text{new state of particle 1}, \text{new state of particle 2}),

where F_a,F_b : \mathcal{X} \times \mathcal{X} \to \mathcal{X} are the components of F. Recall that the particles move past each other when they interact. Thus, to keep track of the whole system we need an element of \mathcal{X} \times \mathcal{X} to keep track of the states and a permutation \sigma \in S_2 to keep track of the positions.

Three particles

Now suppose that we have 3 particles labelled 1,2,3. As before, each particle has a state in \mathcal{X}. We can thus keep track of the state of each particle with an element of \mathcal{X}^3. The particles also have a position which is described by a permutation \sigma \in S_3. The order the entries of (x,y,z) \in \mathcal{X}^3 corresponds to the labels of the particles not their positions. A possible configuration is shown below:

Three dots. The dots are labelled above 1, 3, 2 and labelled below x, z, y.
A possible configuration of the three particles. The above configuration is escribed as having states (x,y,z) \in \mathcal{X}^3 and positions 132 \in S_3.

As before, the particles can interact with each other. However, we’ll now add the restriction that the particles can only interact two at a time and interacting particles must have adjacent positions. When two particles interact, they swap positions and their states change according to F. The state and position of the remaining particle is unchanged. For example, in the above picture we could interact particles 1 and 3. This will produce the below configuration:

The new configuration after interacting particles 1 and 3 in the first diagram. The configuration is now described by the states F_{13}(x,y,z) \in \mathcal{X}^3 and the permutation 312 \in S_3.

To keep track of how the states of the particles change over time we will introduce three functions from \mathcal{X}^3 to \mathcal{X}^3. These functions are F_{12},F_{13},F_{23}. The function F_{ij} is given by applying F to the i,j coordinates of (x,y,z) and acting by the identity on the remaining coordinate. In symbols,

F_{12}(x,y,z) = (F_a(x,y), F_b(x,y), z),
F_{13}(x,y,z) = (F_a(x,z), y, F_b(x,z)),
F_{23}(x,y,z) = (x, F_a(y,z), F_b(y,z)).

The function F_{ij} exactly describes how the states of the three particles change when particles i and j interact. Now suppose that three particles begin in position 123 and states (x,y,z). We cannot directly interact particles 1 and 3 since they are not adjacent. We have to pass first pass one of the particles through particle 2. This means that there are two ways we can interact particles 1 and 3. These are illustrated below.

The two ways of passing through particle 2 to interact particles 2 and 3.

In the top chain of interactions, we first interact particles 2 and 3. In this chain of interactions, the states evolve as follows:

(x,y,z) \to F_{23}(x,y,z) \to   F_{13}(F_{23}(x,y,z)) \to F_{12}(F_{13}(F_{23}(x,y,z))).

In the bottom chain of interactions, we first interact particles 1 and 2. On this chain, the states evolve in a different way:

(x,y,z) \to F_{12}(x,y,z) \to   F_{13}(F_{12}(x,y,z)) \to F_{23}(F_{13}(F_{12}(x,y,z))).

Note that after both of these chains of interactions the particles are in position 321. The function F is said to solve the Yang–Baxter equation if two chains of interactions also result in the same states.

Definition: A function F : \mathcal{X} \times \mathcal{X} \to \mathcal{X} \times \mathcal{X} is a solution to the set theoretic Yang–Baxter equation if,

    F_{12}\circ F_{13} \circ F_{23} = F_{23} \circ F_{13} \circ F_{12}.

This equation can be visualized as the “braid relation” shown below. Here the strings represent the three particles and interacting two particles corresponds to crossing one string over the other.

The braid relation.

Here are some examples of solutions to the set theoretic Yang-Baxter equation,

  • The identity on \mathcal{X} \times \mathcal{X}.
  • The swap map, (x,y) \mapsto (y,x).
  • If f,g : \mathcal{X} \to \mathcal{X} commute, then the function (x,y) \to (f(x), g(y)) is a solution the Yang-Baxter equation.

In the full set of notes I talk about a number of extensions and variations of the Yang-Baxter equation. These include having more than three particles, allowing for the particle states to be entangle and the parametric Yang-Baxter equation.

Concavity of the squared sum of square roots

The p-norm of a vector x\in \mathbb{R}^n is defined to be:

\displaystyle{\Vert x \Vert_p = \left(\sum_{i=1}^n |x_i|^p\right)^{1/p}}.

If p \ge 1, then the p-norm is convex. When 0<p<1, this function is not convex and actually concave when all the entries of x are non-negative. On a recent exam for the course Convex Optimization, we were asked to prove this when p = 1/2. In this special case, the mathematics simplifies nicely.

When p = 1/2, the p-norm can be described as the squared sum of square roots. Specifically,

\displaystyle{\Vert x \Vert_{1/2} = \left(\sum_{i=1}^n \sqrt{|x_i|} \right)^2}.

Note that we can expand the square and rewrite the function as follows

\displaystyle{\Vert x \Vert_{1/2} = \sum_{i=1}^n\sum_{k=1}^n\sqrt{\left|x_i\right|} \sqrt{|x_k|} =\sum_{i=1}^n\sum_{k=1}^n \sqrt{|x_ix_k|}}.

If we restrict to x \in \mathbb{R}^n with x_i \ge 0 for all i, then this function simplifies to

\displaystyle{\Vert x \Vert_{1/2} =\sum_{i=1}^n\sum_{k=1}^n \sqrt{x_ix_k}},

which is a sum of geometric means. The geometric mean function (u,v) \mapsto \sqrt{uv} is concave when u,v \ge 0. This can be proved by calculating the Hessian of this function and verifying that it is negative semi-definite.

From this, we can conclude that each function x \mapsto \sqrt{x_ix_k} is also concave. This is because x \mapsto \sqrt{x_ix_k} is a linear function followed by a concave function. Finally, any sum of concave functions is also concave and thus \Vert x \Vert_{1/2} is concave.

Hellinger distance

A similar argument can be used to show that Hellinger distance is a convex function. Hellinger distance, d_H(\cdot,\cdot) is defined on pairs of probability distributions p and q on a common set \{1,\ldots,n\}. For such a pair,

\displaystyle{d_H(p,q) = \sum_{i=1}^n \left(\sqrt{p_i}-\sqrt{q_i}\right)^2},

which certainly doesn’t look convex. However, we can expand the square and use the fact that p and q are probability distributions. This shows us that Helligener distance can be written as

\displaystyle{d_H(p,q) = \sum_{i=1}^n p_i - 2\sum_{i=1}^n \sqrt{p_iq_i}+\sum_{i=1}^n q_i = 2 - 2\sum_{i=1}^n\sqrt{p_iq_i}}.

Again, each function (p,q) \mapsto \sqrt{p_iq_i} is concave and so the negative sum of such functions is convex. Thus, d_H(p,q) is convex.

The course

As a final comment, I’d just like to say how much I am enjoying the class. Prof. Stephen Boyd is a great lecturer and we’ve seen a wide variety of applications in the class. I’ve recently been reading a bit of John D Cook’s blog and agree with all he says about the course here.

Log-sum-exp trick with negative numbers

Suppose you want to calculate an expression of the form

\displaystyle{\log\left(\sum_{i=1}^n \exp(a_i) - \sum_{j=1}^m \exp(b_j)\right)},

where \sum_{i=1}^n \exp(a_i) > \sum_{j=1}^m \exp(b_j). Such expressions can be difficult to evaluate directly since the exponentials can easily cause overflow errors. In this post, I’ll talk about a clever way to avoid such errors.

If there were no terms in the second sum we could use the log-sum-exp trick. That is, to calculate

\displaystyle{\log\left(\sum_{i=1}^n \exp(a_i) \right)},

we set a=\max\{a_i : 1 \le i \le n\} and use the identity

\displaystyle{a + \log\left(\sum_{i=1}^n \exp(a_i-a)\right) = \log\left(\sum_{i=1}^n \exp(a_i)\right)}.

Since a_i \le a for all i =1,\ldots,n, the left hand side of the above equation can be computed without the risk of overflow. To calculate,

\displaystyle{\log\left(\sum_{i=1}^n \exp(a_i) - \sum_{j=1}^m \exp(b_j)\right)},

we can use the above method to separately calculate

\displaystyle{A = \log\left(\sum_{i=1}^n \exp(a_i) \right)} and \displaystyle{B = \log\left(\sum_{j=1}^m \exp(b_j)\right)}.

The final result we want is

\displaystyle{\log\left(\sum_{i=1}^n \exp(a_i) - \sum_{j=1}^m \exp(b_j) \right) = \log(\exp(A) - \exp(B)) = A + \log(1-\exp(A-B))}.

Since $A > B$, the right hand side of the above expression can be evaluated safely and we will have our final answer.

R code

The R code below defines a function that performs the above procedure

# Safely compute log(sum(exp(pos)) - sum(exp(neg)))
# The default value for neg is an empty vector.

logSumExp <- function(pos, neg = c()){
  max_pos <- max(pos)
  A <- max_pos + log(sum(exp(pos - max_pos)))
  # If neg is empty, the calculation is done
  if (length(neg) == 0){
  # If neg is non-empty, calculate B.
  max_neg <- max(neg)
  B <- max_neg + log(sum(exp(neg - max_neg)))
  # Check that A is bigger than B
  if (A <= B) {
    stop("sum(exp(pos)) must be larger than sum(exp(neg))")
  # log1p() is a built in function that accurately calculates log(1+x) for |x| << 1
  return(A + log1p(-exp(B - A)))

An example

The above procedure can be used to evaulate

\displaystyle{\log\left(\sum_{i=1}^{200} i! - \sum_{i=1}^{500} \binom{500}{i}^2\right)}.

Evaluating this directly would quickly lead to errors since R (and most other programming languages) cannot compute 200!. However, R has the functions lfactorial() and lchoose() which can compute \log(i!) and \log\binom{n}{i} for large values of i and n. We can thus put this expression in the general form at the start of this post

\displaystyle{\log\left(\sum_{i=1}^{200} \exp(\log(i!)) - \sum_{i=1}^{500} \exp\left(2\log\binom{500}{i}\right)\right)}.

The following R code thus us exactly what we want:

pos <- sapply(1:200, lfactorial)
neg <- sapply(1:500, function(i){2*lchoose(500, i)})

logSumExp(pos, neg)
# 863.237

The beta-binomial distribution

The beta-binomial model is a Bayesian model used to analyze rates. For a great derivation and explanation of this model, I highly recommend watching the second lecture from Richard McElreath’s course Statistical Rethinking. In this model, the data, X, is assumed to be binomially distributed with a fixed number of trail N but an unknown rate \rho \in [0,1]. The rate \rho is given a \text{Beta}(a,b) prior. That is the prior distribution of \rho has a density

p(\rho) = \frac{1}{B(a,b)} \rho^{a-1}(1-\rho)^{b-1},

where B(a,b) =\int_0^1 \rho^{a-1}(1-\rho)^{b-1}d\rho is a normalizing constant. The model can thus be written as

\rho \sim \text{Beta}(a,b),
X | \rho \sim \text{Binom}(N,\rho).

This is a conjugate model, meaning that the posterior distribution of \rho is again a beta distribution. This can be seen by using Bayes rule

p(\rho | X) \propto p(X| \rho)p(\rho) \propto \rho^X(1-\rho)^{N-X}\rho^{a-1}(1-\rho)^{b-1}=\rho^{X+a-1}(1-\rho)^{(N-X)+b-1}.

The last expression is proportional to a beta density., specifically \rho | X \sim \text{Beta}(X+a, N-X+b).

The marginal distribution of X

In the above model we are given the distribution of \rho and the conditional distribution of X|\rho. To calculate the distribution of X, we thus need to marginalize over \rho. Specifically,

\displaystyle{p(X) = \int_0^1 p(X,\rho)d\rho = \int_0^1 p(X| \rho)p(\rho)d\rho.}

The term inside the above integral is

\displaystyle{p(X| \rho)p(\rho) = \binom{N}{X}\rho^X(1-\rho)^{N-X}\frac{1}{B(a,b)}\rho^{a-1}(1-\rho)^{b-1} = \frac{\binom{N}{X}}{B(a,b)}\rho^{X+a-1}(1-\rho)^{N-X+b-1} }.


\displaystyle{p(X) = \frac{\binom{N}{X}}{B(a,b)} \int_0^1 \rho^{X+a-1}(1-\rho)^{N-X+b-1}d\rho = \binom{N}{X}\frac{B(X+a, N-X+a)}{B(a,b)}}.

This distribution is called the beta-binomial distribution. Below is an image from Wikipedia showing a graph of p(X) for N=10 and a number of different values of a and b. You can see that, especially for small value of a and b the distribution is a lot more spread out than the binomial distribution. This is because there is randomness coming from both \rho and the binomial conditional distribution.

A plot of the beta-binomial distribution for different values of the parameters a and b. For small values of a and b, the distribution is very spread out.

Two sample tests as correlation tests

Suppose we have two samples Y_1^{(0)}, Y_2^{(0)},\ldots, Y_{n_0}^{(0)} and Y_1^{(1)},Y_2^{(1)},\ldots, Y_{n_1}^{(1)} and we want to test if they are from the same distribution. Many popular tests can be reinterpreted as correlation tests by pooling the two samples and introducing a dummy variable that encodes which sample each data point comes from. In this post we will see how this plays out in a simple t-test.

The equal variance t-test

In the equal variance t-test, we assume that Y_i^{(0)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_0,\sigma^2) and Y_i^{(1)} \stackrel{\text{iid}}{\sim}  \mathcal{N}(\mu_1,\sigma^2), where \sigma^2 is unknown. Our hypothesis that Y_1^{(0)}, Y_2^{(0)},\ldots, Y_{n_0}^{(0)} and Y_1^{(1)},Y_2^{(1)},\ldots, Y_{n_1}^{(1)} are from the same distribution becomes the hypothesis \mu_0 = \mu_1. The test statistic is

t = \frac{\displaystyle \overline{Y}^{(1)} - \overline{Y}^{(0)}}{\displaystyle \hat{\sigma}\sqrt{\frac{1}{n_0}+\frac{1}{n_1}}},

where \overline{Y}^{(0)} and \overline{Y}^{(1)} are the two sample means. The variable \hat{\sigma} is the pooled estimate of the standard deviation and is given by

\hat{\sigma}^2 = \displaystyle\frac{1}{n_0+n_1-2}\left(\sum_{i=1}^{n_0}\left(Y_i^{(0)}-\overline{Y}^{(0)}\right)^2 + \sum_{i=1}^{n_1}\left(Y_i^{(1)}-\overline{Y}^{(1)}\right)^2\right).

Under the null hypothesis, t follows the T-distribution with n_0+n_1-2 degrees of freedom. We thus reject the null \mu_0=\mu_1 when |t| exceeds the 1-\alpha/2 quantile of the T-distribution.

Pooling the data

We can turn this two sample test into a correlation test by pooling the data and using a linear model. Let Y_1,\ldots,Y_{n_0}, Y_{n_0+1},\ldots,Y_{n_0+n_1} be the pooled data and for i = 1,\ldots, n_0+n_1, define x_i \in \{0,1\} by

x_i = \begin{cases} 0 & \text{if } 1 \le i \le n_0,\\ 1 & \text{if } n_0+1 \le i \le n_0+n_1.\end{cases}

The assumptions that Y_i^{(0)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_0,\sigma^2) and Y_i^{(1)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_1,\sigma^2) can be rewritten as

Y_i = \beta_0+\beta_1x_i + \varepsilon_i,

where \varepsilon_i \stackrel{\text{iid}}{\sim} \mathcal{N}(0,\sigma^2). That is, we have expressed our modelling assumptions as a linear model. When working with this linear model, the hypothesis \mu_0 = \mu_1 is equivalent to \beta_1 = 0. To test \beta_1 = 0 we can use the standard t-test for a coefficient in linear model. The test statistic in this case is

t' = \displaystyle\frac{\hat{\beta}_1}{\hat{\sigma}_{OLS}\sqrt{(X^TX)^{-1}_{11}}},

where \hat{\beta}_1 is the ordinary least squares estimate of \beta_1, X \in \mathbb{R}^{(n_0+n_1)\times 2} is the design matrix and \hat{\sigma}_{OLS} is an estimate of \sigma given by

\hat{\sigma}_{OLS}^2 = \displaystyle\frac{1}{n_0+n_1-2}\sum_{i=1}^{n_0+n_1} (Y_i-\hat{Y}_i)^2,

where \hat{Y} = \hat{\beta}_0+\hat{\beta}_1x_i is the fitted value of Y_i.

It turns out that t' is exactly equal to t. We can see this by writing out the design matrix and calculating everything above. The design matrix has rows [1,x_i] and is thus equal to

X = \begin{bmatrix} 1&x_1\\ 1&x_2\\  \vdots&\vdots\\ 1&x_{n_0}\\  1&x_{n_0+1}\\ \vdots&\vdots\\ 1&x_{n_0+n_1}\end{bmatrix} = \begin{bmatrix} 1&0\\ 1&0\\  \vdots&\vdots\\ 1&0\\  1&1\\ \vdots&\vdots\\ 1&1\end{bmatrix}.

This implies that

X^TX = \begin{bmatrix} n_0+n_1 &n_1\\n_1&n_1 \end{bmatrix},

And therefore,

(X^TX)^{-1} = \frac{1}{(n_0+n_1)n_1-n_1^2}\begin{bmatrix} n_1 &-n_1\\-n_1&n_0+n_1 \end{bmatrix} = \frac{1}{n_0n_1}\begin{bmatrix} n_1&-n_1\\-n_1&n_0+n_1\end{bmatrix} =\begin{bmatrix} \frac{1}{n_0}&-\frac{1}{n_0}\\-\frac{1}{n_0}&\frac{1}{n_0}+\frac{1}{n_1}\end{bmatrix} .

Thus, (X^TX)^{-1}_{11} = \frac{1}{n_0}+\frac{1}{n_1}. So,

t' = \displaystyle\frac{\hat{\beta}_1}{\hat{\sigma}_{OLS}\sqrt{\frac{1}{n_0}+\frac{1}{n_1}}},

which is starting to like t from the two-sample test. Now

X^TY = \begin{bmatrix} \displaystyle\sum_{i=1}^{n_0+n_1} Y_i\\ \displaystyle \sum_{i=n_0+1}^{n_0+n_1} Y_i \end{bmatrix} = \begin{bmatrix} n_0\overline{Y}^{(0)} + n_1\overline{Y}^{(1)}\\  n_1\overline{Y}^{(1)} \end{bmatrix}.

And so

\hat{\beta} = (X^TX)^{-1}X^TY = \begin{bmatrix} \frac{1}{n_0}&-\frac{1}{n_0}\\-\frac{1}{n_0}&\frac{1}{n_0}+\frac{1}{n_1}\end{bmatrix}\begin{bmatrix} n_0\overline{Y}^{(0)} + n_1\overline{Y}^{(1)}\\  n_1\overline{Y}^{(1)} \end{bmatrix}=\begin{bmatrix} \overline{Y}^{(0)}\\  \overline{Y}^{(1)} -\overline{Y}^{(0)}\end{bmatrix}.

Thus, \hat{\beta}_1 = \overline{Y}^{(1)} -\overline{Y}^{(0)} and

t' = \displaystyle\frac{\overline{Y}^{(1)}-\overline{Y}^{(0)}}{\hat{\sigma}_{OLS}\sqrt{\frac{1}{n_0}+\frac{1}{n_1}}}.

This means to show that t' = t, we only need to show that \hat{\sigma}_{OLS}^2=\hat{\sigma}^2. To do this, note that the fitted values \hat{Y} are equal to

\displaystyle\hat{Y}_i=\hat{\beta}_0+x_i\hat{\beta}_1 = \begin{cases} \overline{Y}^{(0)} & \text{if } 1 \le i \le n_0,\\ \overline{Y}^{(1)} & \text{if } n_0+1\le i \le n_0+n_1\end{cases}.


\hat{\sigma}^2_{OLS} = \displaystyle\frac{1}{n_0+n_1-2}\sum_{i=1}^{n_0+n_1}\left(Y_i-\hat{Y}_i\right)^2=\displaystyle\frac{1}{n_0+n_1-2}\left(\sum_{i=1}^{n_0}\left(Y_i^{(0)}-\overline{Y}^{(0)}\right)^2 + \sum_{i=1}^{n_1}\left(Y_i^{(1)}-\overline{Y}^{(1)}\right)^2\right),

Which is exactly \hat{\sigma}^2. Therefore, t'=t and the two sample t-test is equivalent to a correlation test.

The Friedman-Rafsky test

In the above example, we saw that the two sample t-test was a special case of the t-test for regressions. This is neat but both tests make very strong assumptions about the data. However, the same thing happens in a more interesting non-parametric setting.

In their 1979 paper, Jerome Friedman and Lawrence Rafsky introduced a two sample tests that makes no assumptions about the distribution of the data. The two samples do not even have to real-valued and can instead be from any metric space. It turns out that their test is a special case of another procedure they devised for testing for association (Friedman & Rafsky, 1983). As with the t-tests above, this connection comes from pooling the two samples and introducing a dummy variable.

I plan to write a follow up post explaining these procedures but you can also read about it in Chapter 6 of Group Representations in Probability and Statistics by Persi Diaconis.


Persi Diaconis “Group representations in probability and statistics,” pp 104-106, Hayward, CA: Institute of Mathematical Statistics, (1988)

Jerome H. Friedman, Lawrence C. Rafsky “Multivariate Generalizations of the Wald-Wolfowitz and Smirnov Two-Sample Tests,” The Annals of Statistics, Ann. Statist. 7(4), 697-717, (July, 1979)

Jerome H. Friedman, Lawrence C. Rafsky “Graph-Theoretic Measures of Multivariate Association and Prediction,” The Annals of Statistics, Ann. Statist. 11(2), 377-391, (June, 1983).

MCMC for hypothesis testing

A Monte Carlo significance test of the null hypothesis X_0 \sim H requires creating independent samples X_1,\ldots,X_B \sim H. The idea is if X_0 \sim H and independently X_1,\ldots,X_B are i.i.d. from H, then for any test statistic T, the rank of T(X_0) among T(X_0), T(X_1),\ldots, T(X_B) is uniformly distributed on \{1,\ldots, B+1\}. This means that if T(X_0) is one of the k largest values of T(X_b), then we can reject the hypothesis X \sim H at the significance level k/(B+1).

The advantage of Monte Carlo significance tests is that we do not need an analytic expression for the distribution of T(X_0) under X_0 \sim H. By generating the i.i.d. samples X_1,\ldots,X_B, we are making an empirical distribution that approximates the theoretical distribution. However, sometimes sampling X_1,\ldots, X_B is just as intractable as theoretically studying the distribution of T(X_0) . Often approximate samples based on Markov chain Monte Carlo (MCMC) are used instead. However, these samples are not independent and may not be sampling from the true distribution. This means that a test using MCMC may not be statistically valid

In the 1989 paper Generalized Monte Carlo significance tests, Besag and Clifford propose two methods that solve this exact problem. Their two methods can be used in the same settings where MCMC is used but they are statistically valid and correctly control the Type 1 error. In this post, I will describe just one of the methods – the serial test.

Background on Markov chains

To describe the serial test we will need to introduce some notation. Let P denote a transition matrix for a Markov chain on a discrete state space \mathcal{X}. A Markov chain with transition matrix P thus satisfies,

\mathbb{P}(X_n = x_n \mid X_0 = x_0,\ldots, X_{n-1} = x_{n-1}) = P(x_{n-1}, x_n)

Suppose that the transition matrix P has a stationary distribution \pi. This means that if (X_0, X_1,\ldots) is a Markov chain with transition matrix P and X_0 is distributed according to \pi, then X_1 is also distributed according to \pi. This implies that all X_i are distributed according to \pi.

We can construct a new transition matrix Q from P and \pi by Q(y,x) = P(x,y) \frac{\pi(x)}{\pi(y)}. The transition matrix Q is called the reversal of P. This is because for all x and y in \mathcal{X}, \pi(x)P(x,y) = \pi(y)Q(x,y). That is the chance of drawing x from \pi and then transitioning to y according to P is equal to the chance of drawing y from \pi and then transitioning to x according to Q

The new transition matrix Q also allows us to reverse longer runs of the Markov chain. Fix n \ge 0 and let (X_0,X_1,\ldots,X_n) be a Markov chain with transition matrix P and initial distribution \pi. Also let (Y_0,Y_1,\ldots,Y_n) be a Markov chain with transition matrix Q and initial distribution \pi, then

(X_0,X_1,\ldots,X_{n-1}, X_n) \stackrel{dist}{=} (Y_n,Y_{n-1},\ldots, Y_1,Y_0) ,

where \stackrel{dist}{=} means equal in distribution.

The serial test

Suppose we want to test the hypothesis x \sim \pi where x \in \mathcal{X} is our observed data and \pi is some distribution on \mathcal{X}. To conduct the serial test, we need to construct a Markov chain P for which \pi is a stationary distribution. We then also need to construct the reversal Q described above. There are many possible ways to construct P such as the Metropolis-Hastings algorithm.

We also need a test statistic T. This is a function T:\mathcal{X} \to \mathbb{R} which we will use to detect outliers. This function is the same function we would use in a regular Monte Carlo test. Namely, we want to reject the null hypothesis when T(x) is much larger than we would expect under x \sim \pi.

The serial test then proceeds as follows. First we pick \xi uniformly in \{0,1,\ldots,B\} and set X_\xi = x. We then generate (X_\xi, X_{\xi+1},\ldots,X_B) as a Markov chain with transition matrix P that starts at X_\xi = x. Likewise we generate (X_\xi, X_{\xi-1},\ldots, X_0) as a Markov chain that starts from Q.

We then apply T to each of (X_0,X_1,\ldots,X_\xi,\ldots,X_B) and count the number of b \in \{0,1,\ldots,B\} such that T(X_b) \ge T(X_\xi) = T(x). If there are k such b, then the reported p-value of our test is k/(B+1).

We will now show that this test produces a valid p-value. That is, when x \sim \pi, the probability that k/(B+1) is less than \alpha is at most \alpha. In symbols,

\mathbb{P}\left( \frac{\#\{b: T(X_b) \ge T(X_\xi)\}}{B+1} \le \alpha \right) \le \alpha

Under the null hypothesis x \sim \pi, (X_\xi, X_{\xi-1},\ldots, X_0) is equal in distribution to generating X_0 from \pi and using the transition matrix P to go from X_i to X_{i+1}.

Thus, under the null hypothesis, the distribution of (X_0,X_1,\ldots,X_B) does not depend on \xi. The entire procedure is equivalent to generating a Markov chain \mathbf{X} = (X_0,X_1,\ldots,X_B) with initial distribution \pi and transition matrix P, and then choosing \xi \in \{0,1,\ldots,B\} independently of \mathbf{X}. This is enough to show that the serial method produces valid p-values. The idea is that since the distribution of \mathbf{X} does not depend on \xi and \xi is uniformly distributed on \{0,1,\ldots,B\}, the probability that T(X_\xi) is in the top \alpha proportion of T(X_b) should be at most \alpha. This is proved more formally below.

For each i \in \{0,1,\ldots,B\}, let A_i be the event that T(X_i) is in the top \alpha proportion of T(X_0),\ldots,T(X_B). That is,

A_i = \left\{ \frac{\# \{b : T(X_b) \ge T(X_i)\} }{B+1} \le \alpha \right\}.

Let I_{A_i} be the indicator function for the event A_i. Since at must \alpha(B+1) values of i can be in the top \alpha fraction of T(X_0),\ldots,T(X_B), we have that

\sum_{i=0}^B I_{A_i} \le \alpha(B+1),

Therefor, by linearity of expectations,

\sum_{i=0}^B \mathbb{P}(A_i) \le \alpha(B+1)

By the law of total probability we have,

\mathbb{P}\left(\frac{\#\{b: T(X_b) \ge T(X_\xi)\}}{B+1} \le \alpha \right) = \sum_{i=0}^B \mathbb{P}\left(\frac{\#\{b: T(X_b) \ge T(X_\xi)\}}{B+1} \le \alpha | \xi = i\right)\mathbb{P}(\xi = i),

Since \xi is uniform on \{0,\ldots,B\}, \mathbb{P}(\xi = i) is \frac{1}{B+1} for all i. Furthermore, by independence of \xi and \mathbf{X}, we have

\mathbb{P}\left(\frac{\#\{b: T(X_b) \ge T(X_\xi)\}}{B+1} \le \alpha | \xi = i\right) = \mathbb{P}\left(\frac{\#\{b: T(X_b) \ge T(X_i)\}}{B+1} \le \alpha | \xi = i\right) = \mathbb{P}(A_i).

Thus, by our previous bound,

\mathbb{P}\left(\frac{\#\{b: T(X_b) \ge T(X_\xi)\}}{B+1} \le \alpha \right) = \frac{1}{B+1}\sum_{i=0}^B \mathbb{P}(A_i) \le \alpha.


In the original paper by Besag and Clifford, the authors discuss how this procedure can be used to perform goodness-of-fit tests. They construct Markov chains that can test the Rasch model or the Ising model. More broadly the method can be used to tests goodness-of-fit tests for any exponential family by using the Markov chains developed by Diaconis and Sturmfels.

A similar method has also been applied more recently to detect Gerrymandering. In this setting, the null hypothesis is the uniform distribution on all valid redistrictings of a state and the test statistics T measure the political advantage of a given districting.


  1. Besag, Julian, and Peter Clifford. “Generalized Monte Carlo Significance Tests.” Biometrika 76, no. 4 (1989)
  2. Persi Diaconis, Bernd Sturmfels “Algebraic algorithms for sampling from conditional distributions,” The Annals of Statistics, Ann. Statist. 26(1), 363-397, (1998)
  3. Chikina, Maria et al. “Assessing significance in a Markov chain without mixing.” Proceedings of the National Academy of Sciences of the United States of America vol. 114,11 (2017)

How to Bake Pi, Sherman-Morrison and log-sum-exp

A few months ago, I had the pleasure of reading Eugenia Cheng‘s book How to Bake Pi. Each chapter starts with a recipe which Cheng links to the mathematical concepts contained in the chapter. The book is full of interesting connections between mathematics and the rest of the world.

One of my favourite ideas in the book is something Cheng writes about equations and the humble equals sign: =. She explains that when an equation says two things are equal we very rarely mean that they are exactly the same thing. What we really mean is that the two things are the same in some ways even though they may be different in others.

One example that Cheng gives is the equation a + b = b+a. This is such a familiar statement that you might really think that a+b and b+a are the same thing. Indeed, if a and b are any numbers, then the number you get when you calculate a + b is the same as the number you get when you calculate b + a. But calculating a+b could be very different from calculating b+a. A young child might calculate a+b by starting with a and then counting one-by-one from a to a + b. If a is 1 and b is 20, then calculating a + b requires counting from 1 to 21 but calculating b+a simply amounts to counting from 20 to 21. The first process takes way longer than the second and the child might disagree that 1 + 20 is the same as 20 + 1.

In How to Bake Pi, Cheng explains that a crucial idea behind equality is context. When someone says that two things are equal we really mean that they are equal in the context we care about. Cheng talks about how context is crucial through-out mathematics and introduces a little bit of category theory as a tool for moving between different contexts. I think that this idea of context is really illuminating and I wanted to share some examples where “=” doesn’t mean “exactly the same as”.

The Sherman-Morrison formula

The Sherman-Morrison formula is a result from linear algebra that says for any invertible matrix A \in \mathbb{R}^{n\times n} and any pair of vectors u,v \in \mathbb{R}^n, if v^TA^{-1}u \neq -1, then A + uv^T is invertible and

(A + uv^T)^{-1} = A^{-1} + \frac{A^{-1}uv^TA^{-1}}{1 + v^TA^{-1}u}

Here “=” means the following:

  1. You can take any natural number n, any matrix A of size n by n, and any length n-vectors u and v that satisfy the above condition.
  2. If you take all those things and carry out all the matrix multiplications, additions and inversions on the left and all the matrix multiplications, additions and inversions on the right, then you will end up with exactly the same matrix in both cases.

But depending on the context, the equation on one side of “=” may be much easier than the other. Although the right hand side looks a lot more complicated, it is much easier to compute in one important context. This context is when we have already calculated the matrix A^{-1} and now want the inverse of A + uv^T. The left hand side naively computes A + uv^T which takes O(n^3) computations since we have to invert a n \times n matrix. On the right hand side, we only need to compute a small number of matrix-vector products and then add two matrices together. This bring the computational cost down to O(n^2).

These cost saving measures come up a lot when studying linear regression. The Sherman-Morrison formula can be used to update regression coefficients when a new data point is added. Similarly, the Sherman-Morrison formula can be used to quickly calculate the fitted values in leave-one-out cross validation.


This second example also has connections to statistics. In a mixture model, we assume that each data point Y comes from a distribution of the form:

p(y|\pi,\theta) = \sum_{k=1}^K \pi_k p(y | \theta_k),

where \pi is a vector and \pi_k is equal to the probability that Y came from class k. The parameters \theta_k \in\mathbb{R}^p are the parameters for the k^{th} group.The log-likelihood is thus,

\log\left(\sum_{k=1}^K \pi_k p(y | \theta_k)\right) = \log\left(\sum_{k=1}^K \exp(\eta_{k})\right),

where \eta_{k} = \log(\pi_k p(y| \theta_k)). We can see that the log-likelihood is of the form log-sum-exp. Calculating a log-sum-exp can cause issues with numerical stability. For instance if K = 3 and \eta_k = 1000, for all k=1,2,3, then the final answer is simply \log(3)+1000. However, as soon as we try to calculate \exp(1000) on a computer, we’ll be in trouble.

The solution is to use the following equality, for any \beta \in \mathbb{R},

\log\left(\sum_{k=1}^K \exp(\eta_k) \right) = \beta + \log\left(\sum_{k=1}^K \exp(\beta - \eta_k)\right).

Proving the above identity is a nice exercise in the laws of logarithm’s and exponential’s, but with a clever choice of \beta we can more safely compute the log-sum-exp expression. For instance, in the documentation for pytorch’s implementation of logsumexp() they take \beta to be the maximum of \eta_k. This (hopefully) makes each of the terms \beta - \eta_k a reasonable size and avoids any numerical issues.

Again, the left and right hand sides of the above equation might be the same number, but in the context of having to use computers with limited precision, they represent very different calculations.

Beyond How to Bake Pi

Eugenia Cheng has recently published a new book called The Joy of Abstraction. I’m just over half way through and it’s been a really engaging and interesting introduction to category theory. I’m looking forward to reading the rest of it and getting more insight from Eugenia Cheng’s great mathematical writing.