Braids and the Yang-Baxter equation

I recently gave a talk on the Yang-Baxter equation. The focus of the talk was to state the connection between the Yang-Baxter equation and the braid relation. This connection comes from a system of interacting particles. In this post, I’ll go over part of my talk. You can access the full set of notes here.

Interacting particles

Imagine two particles on a line, each with a state that can be any element of a set \mathcal{X}. Suppose that the only way particles can change their states is by interacting with each other. An interaction occurs when two particles pass by each other. We could define a function F : \mathcal{X} \times \mathcal{X} \to \mathcal{X} \times \mathcal{X} that describes how the states change after interaction. Specifically, if the first particle is in state x and the second particle is in state y, then their states after interacting will be

F(x,y) = (F_a(x,y), F_b(x,y)) = (\text{new state of particle 1}, \text{new state of particle 2}),

where F_a,F_b : \mathcal{X} \times \mathcal{X} \to \mathcal{X} are the components of F. Recall that the particles move past each other when they interact. Thus, to keep track of the whole system we need an element of \mathcal{X} \times \mathcal{X} to keep track of the states and a permutation \sigma \in S_2 to keep track of the positions.

Three particles

Now suppose that we have 3 particles labelled 1,2,3. As before, each particle has a state in \mathcal{X}. We can thus keep track of the state of each particle with an element of \mathcal{X}^3. The particles also have a position which is described by a permutation \sigma \in S_3. The order the entries of (x,y,z) \in \mathcal{X}^3 corresponds to the labels of the particles not their positions. A possible configuration is shown below:

Three dots. The dots are labelled above 1, 3, 2 and labelled below x, z, y.
A possible configuration of the three particles. The above configuration is escribed as having states (x,y,z) \in \mathcal{X}^3 and positions 132 \in S_3.

As before, the particles can interact with each other. However, we’ll now add the restriction that the particles can only interact two at a time and interacting particles must have adjacent positions. When two particles interact, they swap positions and their states change according to F. The state and position of the remaining particle is unchanged. For example, in the above picture we could interact particles 1 and 3. This will produce the below configuration:

The new configuration after interacting particles 1 and 3 in the first diagram. The configuration is now described by the states F_{13}(x,y,z) \in \mathcal{X}^3 and the permutation 312 \in S_3.

To keep track of how the states of the particles change over time we will introduce three functions from \mathcal{X}^3 to \mathcal{X}^3. These functions are F_{12},F_{13},F_{23}. The function F_{ij} is given by applying F to the i,j coordinates of (x,y,z) and acting by the identity on the remaining coordinate. In symbols,

F_{12}(x,y,z) = (F_a(x,y), F_b(x,y), z),
F_{13}(x,y,z) = (F_a(x,z), y, F_b(x,z)),
F_{23}(x,y,z) = (x, F_a(y,z), F_b(y,z)).

The function F_{ij} exactly describes how the states of the three particles change when particles i and j interact. Now suppose that three particles begin in position 123 and states (x,y,z). We cannot directly interact particles 1 and 3 since they are not adjacent. We have to pass first pass one of the particles through particle 2. This means that there are two ways we can interact particles 1 and 3. These are illustrated below.

The two ways of passing through particle 2 to interact particles 2 and 3.

In the top chain of interactions, we first interact particles 2 and 3. In this chain of interactions, the states evolve as follows:

(x,y,z) \to F_{23}(x,y,z) \to   F_{13}(F_{23}(x,y,z)) \to F_{12}(F_{13}(F_{23}(x,y,z))).

In the bottom chain of interactions, we first interact particles 1 and 2. On this chain, the states evolve in a different way:

(x,y,z) \to F_{12}(x,y,z) \to   F_{13}(F_{12}(x,y,z)) \to F_{23}(F_{13}(F_{12}(x,y,z))).

Note that after both of these chains of interactions the particles are in position 321. The function F is said to solve the Yang–Baxter equation if two chains of interactions also result in the same states.

Definition: A function F : \mathcal{X} \times \mathcal{X} \to \mathcal{X} \times \mathcal{X} is a solution to the set theoretic Yang–Baxter equation if,

    F_{12}\circ F_{13} \circ F_{23} = F_{23} \circ F_{13} \circ F_{12}.

This equation can be visualized as the “braid relation” shown below. Here the strings represent the three particles and interacting two particles corresponds to crossing one string over the other.

The braid relation.

Here are some examples of solutions to the set theoretic Yang-Baxter equation,

  • The identity on \mathcal{X} \times \mathcal{X}.
  • The swap map, (x,y) \mapsto (y,x).
  • If f,g : \mathcal{X} \to \mathcal{X} commute, then the function (x,y) \to (f(x), g(y)) is a solution the Yang-Baxter equation.

In the full set of notes I talk about a number of extensions and variations of the Yang-Baxter equation. These include having more than three particles, allowing for the particle states to be entangle and the parametric Yang-Baxter equation.

Concavity of the squared sum of square roots

The p-norm of a vector x\in \mathbb{R}^n is defined to be:

\displaystyle{\Vert x \Vert_p = \left(\sum_{i=1}^n |x_i|^p\right)^{1/p}}.

If p \ge 1, then the p-norm is convex. When 0<p<1, this function is not convex and actually concave when all the entries of x are non-negative. On a recent exam for the course Convex Optimization, we were asked to prove this when p = 1/2. In this special case, the mathematics simplifies nicely.

When p = 1/2, the p-norm can be described as the squared sum of square roots. Specifically,

\displaystyle{\Vert x \Vert_{1/2} = \left(\sum_{i=1}^n \sqrt{|x_i|} \right)^2}.

Note that we can expand the square and rewrite the function as follows

\displaystyle{\Vert x \Vert_{1/2} = \sum_{i=1}^n\sum_{k=1}^n\sqrt{\left|x_i\right|} \sqrt{|x_k|} =\sum_{i=1}^n\sum_{k=1}^n \sqrt{|x_ix_k|}}.

If we restrict to x \in \mathbb{R}^n with x_i \ge 0 for all i, then this function simplifies to

\displaystyle{\Vert x \Vert_{1/2} =\sum_{i=1}^n\sum_{k=1}^n \sqrt{x_ix_k}},

which is a sum of geometric means. The geometric mean function (u,v) \mapsto \sqrt{uv} is concave when u,v \ge 0. This can be proved by calculating the Hessian of this function and verifying that it is negative semi-definite.

From this, we can conclude that each function x \mapsto \sqrt{x_ix_k} is also concave. This is because x \mapsto \sqrt{x_ix_k} is a linear function followed by a concave function. Finally, any sum of concave functions is also concave and thus \Vert x \Vert_{1/2} is concave.

Hellinger distance

A similar argument can be used to show that Hellinger distance is a convex function. Hellinger distance, d_H(\cdot,\cdot) is defined on pairs of probability distributions p and q on a common set \{1,\ldots,n\}. For such a pair,

\displaystyle{d_H(p,q) = \sum_{i=1}^n \left(\sqrt{p_i}-\sqrt{q_i}\right)^2},

which certainly doesn’t look convex. However, we can expand the square and use the fact that p and q are probability distributions. This shows us that Helligener distance can be written as

\displaystyle{d_H(p,q) = \sum_{i=1}^n p_i - 2\sum_{i=1}^n \sqrt{p_iq_i}+\sum_{i=1}^n q_i = 2 - 2\sum_{i=1}^n\sqrt{p_iq_i}}.

Again, each function (p,q) \mapsto \sqrt{p_iq_i} is concave and so the negative sum of such functions is convex. Thus, d_H(p,q) is convex.

The course

As a final comment, I’d just like to say how much I am enjoying the class. Prof. Stephen Boyd is a great lecturer and we’ve seen a wide variety of applications in the class. I’ve recently been reading a bit of John D Cook’s blog and agree with all he says about the course here.

Log-sum-exp trick with negative numbers

Suppose you want to calculate an expression of the form

\displaystyle{\log\left(\sum_{i=1}^n \exp(a_i) - \sum_{j=1}^m \exp(b_j)\right)},

where \sum_{i=1}^n \exp(a_i) > \sum_{j=1}^m \exp(b_j). Such expressions can be difficult to evaluate directly since the exponentials can easily cause overflow errors. In this post, I’ll talk about a clever way to avoid such errors.

If there were no terms in the second sum we could use the log-sum-exp trick. That is, to calculate

\displaystyle{\log\left(\sum_{i=1}^n \exp(a_i) \right)},

we set a=\max\{a_i : 1 \le i \le n\} and use the identity

\displaystyle{a + \log\left(\sum_{i=1}^n \exp(a_i-a)\right) = \log\left(\sum_{i=1}^n \exp(a_i)\right)}.

Since a_i \le a for all i =1,\ldots,n, the left hand side of the above equation can be computed without the risk of overflow. To calculate,

\displaystyle{\log\left(\sum_{i=1}^n \exp(a_i) - \sum_{j=1}^m \exp(b_j)\right)},

we can use the above method to separately calculate

\displaystyle{A = \log\left(\sum_{i=1}^n \exp(a_i) \right)} and \displaystyle{B = \log\left(\sum_{j=1}^m \exp(b_j)\right)}.

The final result we want is

\displaystyle{\log\left(\sum_{i=1}^n \exp(a_i) - \sum_{j=1}^m \exp(b_j) \right) = \log(\exp(A) - \exp(B)) = A + \log(1-\exp(A-B))}.

Since $A > B$, the right hand side of the above expression can be evaluated safely and we will have our final answer.

R code

The R code below defines a function that performs the above procedure

# Safely compute log(sum(exp(pos)) - sum(exp(neg)))
# The default value for neg is an empty vector.

logSumExp <- function(pos, neg = c()){
  max_pos <- max(pos)
  A <- max_pos + log(sum(exp(pos - max_pos)))
  
  # If neg is empty, the calculation is done
  if (length(neg) == 0){
    return(A)
  }
  
  # If neg is non-empty, calculate B.
  max_neg <- max(neg)
  B <- max_neg + log(sum(exp(neg - max_neg)))
  
  # Check that A is bigger than B
  if (A <= B) {
    stop("sum(exp(pos)) must be larger than sum(exp(neg))")
  }
  
  # log1p() is a built in function that accurately calculates log(1+x) for |x| << 1
  return(A + log1p(-exp(B - A)))
}

An example

The above procedure can be used to evaulate

\displaystyle{\log\left(\sum_{i=1}^{200} i! - \sum_{i=1}^{500} \binom{500}{i}^2\right)}.

Evaluating this directly would quickly lead to errors since R (and most other programming languages) cannot compute 200!. However, R has the functions lfactorial() and lchoose() which can compute \log(i!) and \log\binom{n}{i} for large values of i and n. We can thus put this expression in the general form at the start of this post

\displaystyle{\log\left(\sum_{i=1}^{200} \exp(\log(i!)) - \sum_{i=1}^{500} \exp\left(2\log\binom{500}{i}\right)\right)}.

The following R code thus us exactly what we want:

pos <- sapply(1:200, lfactorial)
neg <- sapply(1:500, function(i){2*lchoose(500, i)})

logSumExp(pos, neg)
# 863.237

The beta-binomial distribution

The beta-binomial model is a Bayesian model used to analyze rates. For a great derivation and explanation of this model, I highly recommend watching the second lecture from Richard McElreath’s course Statistical Rethinking. In this model, the data, X, is assumed to be binomially distributed with a fixed number of trail N but an unknown rate \rho \in [0,1]. The rate \rho is given a \text{Beta}(a,b) prior. That is the prior distribution of \rho has a density

p(\rho) = \frac{1}{B(a,b)} \rho^{a-1}(1-\rho)^{b-1},

where B(a,b) =\int_0^1 \rho^{a-1}(1-\rho)^{b-1}d\rho is a normalizing constant. The model can thus be written as

\rho \sim \text{Beta}(a,b),
X | \rho \sim \text{Binom}(N,\rho).

This is a conjugate model, meaning that the posterior distribution of \rho is again a beta distribution. This can be seen by using Bayes rule

p(\rho | X) \propto p(X| \rho)p(\rho) \propto \rho^X(1-\rho)^{N-X}\rho^{a-1}(1-\rho)^{b-1}=\rho^{X+a-1}(1-\rho)^{(N-X)+b-1}.

The last expression is proportional to a beta density., specifically \rho | X \sim \text{Beta}(X+a, N-X+b).

The marginal distribution of X

In the above model we are given the distribution of \rho and the conditional distribution of X|\rho. To calculate the distribution of X, we thus need to marginalize over \rho. Specifically,

\displaystyle{p(X) = \int_0^1 p(X,\rho)d\rho = \int_0^1 p(X| \rho)p(\rho)d\rho.}

The term inside the above integral is

\displaystyle{p(X| \rho)p(\rho) = \binom{N}{X}\rho^X(1-\rho)^{N-X}\frac{1}{B(a,b)}\rho^{a-1}(1-\rho)^{b-1} = \frac{\binom{N}{X}}{B(a,b)}\rho^{X+a-1}(1-\rho)^{N-X+b-1} }.

Thus,

\displaystyle{p(X) = \frac{\binom{N}{X}}{B(a,b)} \int_0^1 \rho^{X+a-1}(1-\rho)^{N-X+b-1}d\rho = \binom{N}{X}\frac{B(X+a, N-X+a)}{B(a,b)}}.

This distribution is called the beta-binomial distribution. Below is an image from Wikipedia showing a graph of p(X) for N=10 and a number of different values of a and b. You can see that, especially for small value of a and b the distribution is a lot more spread out than the binomial distribution. This is because there is randomness coming from both \rho and the binomial conditional distribution.

A plot of the beta-binomial distribution for different values of the parameters a and b. For small values of a and b, the distribution is very spread out.

Two sample tests as correlation tests

Suppose we have two samples Y_1^{(0)}, Y_2^{(0)},\ldots, Y_{n_0}^{(0)} and Y_1^{(1)},Y_2^{(1)},\ldots, Y_{n_1}^{(1)} and we want to test if they are from the same distribution. Many popular tests can be reinterpreted as correlation tests by pooling the two samples and introducing a dummy variable that encodes which sample each data point comes from. In this post we will see how this plays out in a simple t-test.

The equal variance t-test

In the equal variance t-test, we assume that Y_i^{(0)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_0,\sigma^2) and Y_i^{(1)} \stackrel{\text{iid}}{\sim}  \mathcal{N}(\mu_1,\sigma^2), where \sigma^2 is unknown. Our hypothesis that Y_1^{(0)}, Y_2^{(0)},\ldots, Y_{n_0}^{(0)} and Y_1^{(1)},Y_2^{(1)},\ldots, Y_{n_1}^{(1)} are from the same distribution becomes the hypothesis \mu_0 = \mu_1. The test statistic is

t = \frac{\displaystyle \overline{Y}^{(1)} - \overline{Y}^{(0)}}{\displaystyle \hat{\sigma}\sqrt{\frac{1}{n_0}+\frac{1}{n_1}}},

where \overline{Y}^{(0)} and \overline{Y}^{(1)} are the two sample means. The variable \hat{\sigma} is the pooled estimate of the standard deviation and is given by

\hat{\sigma}^2 = \displaystyle\frac{1}{n_0+n_1-2}\left(\sum_{i=1}^{n_0}\left(Y_i^{(0)}-\overline{Y}^{(0)}\right)^2 + \sum_{i=1}^{n_1}\left(Y_i^{(1)}-\overline{Y}^{(1)}\right)^2\right).

Under the null hypothesis, t follows the T-distribution with n_0+n_1-2 degrees of freedom. We thus reject the null \mu_0=\mu_1 when |t| exceeds the 1-\alpha/2 quantile of the T-distribution.

Pooling the data

We can turn this two sample test into a correlation test by pooling the data and using a linear model. Let Y_1,\ldots,Y_{n_0}, Y_{n_0+1},\ldots,Y_{n_0+n_1} be the pooled data and for i = 1,\ldots, n_0+n_1, define x_i \in \{0,1\} by

x_i = \begin{cases} 0 & \text{if } 1 \le i \le n_0,\\ 1 & \text{if } n_0+1 \le i \le n_0+n_1.\end{cases}

The assumptions that Y_i^{(0)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_0,\sigma^2) and Y_i^{(1)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_1,\sigma^2) can be rewritten as

Y_i = \beta_0+\beta_1x_i + \varepsilon_i,

where \varepsilon_i \stackrel{\text{iid}}{\sim} \mathcal{N}(0,\sigma^2). That is, we have expressed our modelling assumptions as a linear model. When working with this linear model, the hypothesis \mu_0 = \mu_1 is equivalent to \beta_1 = 0. To test \beta_1 = 0 we can use the standard t-test for a coefficient in linear model. The test statistic in this case is

t' = \displaystyle\frac{\hat{\beta}_1}{\hat{\sigma}_{OLS}\sqrt{(X^TX)^{-1}_{11}}},

where \hat{\beta}_1 is the ordinary least squares estimate of \beta_1, X \in \mathbb{R}^{(n_0+n_1)\times 2} is the design matrix and \hat{\sigma}_{OLS} is an estimate of \sigma given by

\hat{\sigma}_{OLS}^2 = \displaystyle\frac{1}{n_0+n_1-2}\sum_{i=1}^{n_0+n_1} (Y_i-\hat{Y}_i)^2,

where \hat{Y} = \hat{\beta}_0+\hat{\beta}_1x_i is the fitted value of Y_i.

It turns out that t' is exactly equal to t. We can see this by writing out the design matrix and calculating everything above. The design matrix has rows [1,x_i] and is thus equal to

X = \begin{bmatrix} 1&x_1\\ 1&x_2\\  \vdots&\vdots\\ 1&x_{n_0}\\  1&x_{n_0+1}\\ \vdots&\vdots\\ 1&x_{n_0+n_1}\end{bmatrix} = \begin{bmatrix} 1&0\\ 1&0\\  \vdots&\vdots\\ 1&0\\  1&1\\ \vdots&\vdots\\ 1&1\end{bmatrix}.

This implies that

X^TX = \begin{bmatrix} n_0+n_1 &n_1\\n_1&n_1 \end{bmatrix},

And therefore,

(X^TX)^{-1} = \frac{1}{(n_0+n_1)n_1-n_1^2}\begin{bmatrix} n_1 &-n_1\\-n_1&n_0+n_1 \end{bmatrix} = \frac{1}{n_0n_1}\begin{bmatrix} n_1&-n_1\\-n_1&n_0+n_1\end{bmatrix} =\begin{bmatrix} \frac{1}{n_0}&-\frac{1}{n_0}\\-\frac{1}{n_0}&\frac{1}{n_0}+\frac{1}{n_1}\end{bmatrix} .

Thus, (X^TX)^{-1}_{11} = \frac{1}{n_0}+\frac{1}{n_1}. So,

t' = \displaystyle\frac{\hat{\beta}_1}{\hat{\sigma}_{OLS}\sqrt{\frac{1}{n_0}+\frac{1}{n_1}}},

which is starting to like t from the two-sample test. Now

X^TY = \begin{bmatrix} \displaystyle\sum_{i=1}^{n_0+n_1} Y_i\\ \displaystyle \sum_{i=n_0+1}^{n_0+n_1} Y_i \end{bmatrix} = \begin{bmatrix} n_0\overline{Y}^{(0)} + n_1\overline{Y}^{(1)}\\  n_1\overline{Y}^{(1)} \end{bmatrix}.

And so

\hat{\beta} = (X^TX)^{-1}X^TY = \begin{bmatrix} \frac{1}{n_0}&-\frac{1}{n_0}\\-\frac{1}{n_0}&\frac{1}{n_0}+\frac{1}{n_1}\end{bmatrix}\begin{bmatrix} n_0\overline{Y}^{(0)} + n_1\overline{Y}^{(1)}\\  n_1\overline{Y}^{(1)} \end{bmatrix}=\begin{bmatrix} \overline{Y}^{(0)}\\  \overline{Y}^{(1)} -\overline{Y}^{(0)}\end{bmatrix}.

Thus, \hat{\beta}_1 = \overline{Y}^{(1)} -\overline{Y}^{(0)} and

t' = \displaystyle\frac{\overline{Y}^{(1)}-\overline{Y}^{(0)}}{\hat{\sigma}_{OLS}\sqrt{\frac{1}{n_0}+\frac{1}{n_1}}}.

This means to show that t' = t, we only need to show that \hat{\sigma}_{OLS}^2=\hat{\sigma}^2. To do this, note that the fitted values \hat{Y} are equal to

\displaystyle\hat{Y}_i=\hat{\beta}_0+x_i\hat{\beta}_1 = \begin{cases} \overline{Y}^{(0)} & \text{if } 1 \le i \le n_0,\\ \overline{Y}^{(1)} & \text{if } n_0+1\le i \le n_0+n_1\end{cases}.

Thus,

\hat{\sigma}^2_{OLS} = \displaystyle\frac{1}{n_0+n_1-2}\sum_{i=1}^{n_0+n_1}\left(Y_i-\hat{Y}_i\right)^2=\displaystyle\frac{1}{n_0+n_1-2}\left(\sum_{i=1}^{n_0}\left(Y_i^{(0)}-\overline{Y}^{(0)}\right)^2 + \sum_{i=1}^{n_1}\left(Y_i^{(1)}-\overline{Y}^{(1)}\right)^2\right),

Which is exactly \hat{\sigma}^2. Therefore, t'=t and the two sample t-test is equivalent to a correlation test.

The Friedman-Rafsky test

In the above example, we saw that the two sample t-test was a special case of the t-test for regressions. This is neat but both tests make very strong assumptions about the data. However, the same thing happens in a more interesting non-parametric setting.

In their 1979 paper, Jerome Friedman and Lawrence Rafsky introduced a two sample tests that makes no assumptions about the distribution of the data. The two samples do not even have to real-valued and can instead be from any metric space. It turns out that their test is a special case of another procedure they devised for testing for association (Friedman & Rafsky, 1983). As with the t-tests above, this connection comes from pooling the two samples and introducing a dummy variable.

I plan to write a follow up post explaining these procedures but you can also read about it in Chapter 6 of Group Representations in Probability and Statistics by Persi Diaconis.

References

Persi Diaconis “Group representations in probability and statistics,” pp 104-106, Hayward, CA: Institute of Mathematical Statistics, (1988)

Jerome H. Friedman, Lawrence C. Rafsky “Multivariate Generalizations of the Wald-Wolfowitz and Smirnov Two-Sample Tests,” The Annals of Statistics, Ann. Statist. 7(4), 697-717, (July, 1979)

Jerome H. Friedman, Lawrence C. Rafsky “Graph-Theoretic Measures of Multivariate Association and Prediction,” The Annals of Statistics, Ann. Statist. 11(2), 377-391, (June, 1983).

MCMC for hypothesis testing

A Monte Carlo significance test of the null hypothesis X_0 \sim H requires creating independent samples X_1,\ldots,X_B \sim H. The idea is if X_0 \sim H and independently X_1,\ldots,X_B are i.i.d. from H, then for any test statistic T, the rank of T(X_0) among T(X_0), T(X_1),\ldots, T(X_B) is uniformly distributed on \{1,\ldots, B+1\}. This means that if T(X_0) is one of the k largest values of T(X_b), then we can reject the hypothesis X \sim H at the significance level k/(B+1).

The advantage of Monte Carlo significance tests is that we do not need an analytic expression for the distribution of T(X_0) under X_0 \sim H. By generating the i.i.d. samples X_1,\ldots,X_B, we are making an empirical distribution that approximates the theoretical distribution. However, sometimes sampling X_1,\ldots, X_B is just as intractable as theoretically studying the distribution of T(X_0) . Often approximate samples based on Markov chain Monte Carlo (MCMC) are used instead. However, these samples are not independent and may not be sampling from the true distribution. This means that a test using MCMC may not be statistically valid

In the 1989 paper Generalized Monte Carlo significance tests, Besag and Clifford propose two methods that solve this exact problem. Their two methods can be used in the same settings where MCMC is used but they are statistically valid and correctly control the Type 1 error. In this post, I will describe just one of the methods – the serial test.

Background on Markov chains

To describe the serial test we will need to introduce some notation. Let P denote a transition matrix for a Markov chain on a discrete state space \mathcal{X}. A Markov chain with transition matrix P thus satisfies,

\mathbb{P}(X_n = x_n \mid X_0 = x_0,\ldots, X_{n-1} = x_{n-1}) = P(x_{n-1}, x_n)

Suppose that the transition matrix P has a stationary distribution \pi. This means that if (X_0, X_1,\ldots) is a Markov chain with transition matrix P and X_0 is distributed according to \pi, then X_1 is also distributed according to \pi. This implies that all X_i are distributed according to \pi.

We can construct a new transition matrix Q from P and \pi by Q(y,x) = P(x,y) \frac{\pi(x)}{\pi(y)}. The transition matrix Q is called the reversal of P. This is because for all x and y in \mathcal{X}, \pi(x)P(x,y) = \pi(y)Q(x,y). That is the chance of drawing x from \pi and then transitioning to y according to P is equal to the chance of drawing y from \pi and then transitioning to x according to Q

The new transition matrix Q also allows us to reverse longer runs of the Markov chain. Fix n \ge 0 and let (X_0,X_1,\ldots,X_n) be a Markov chain with transition matrix P and initial distribution \pi. Also let (Y_0,Y_1,\ldots,Y_n) be a Markov chain with transition matrix Q and initial distribution \pi, then

(X_0,X_1,\ldots,X_{n-1}, X_n) \stackrel{dist}{=} (Y_n,Y_{n-1},\ldots, Y_1,Y_0) ,

where \stackrel{dist}{=} means equal in distribution.

The serial test

Suppose we want to test the hypothesis x \sim \pi where x \in \mathcal{X} is our observed data and \pi is some distribution on \mathcal{X}. To conduct the serial test, we need to construct a Markov chain P for which \pi is a stationary distribution. We then also need to construct the reversal Q described above. There are many possible ways to construct P such as the Metropolis-Hastings algorithm.

We also need a test statistic T. This is a function T:\mathcal{X} \to \mathbb{R} which we will use to detect outliers. This function is the same function we would use in a regular Monte Carlo test. Namely, we want to reject the null hypothesis when T(x) is much larger than we would expect under x \sim \pi.

The serial test then proceeds as follows. First we pick \xi uniformly in \{0,1,\ldots,B\} and set X_\xi = x. We then generate (X_\xi, X_{\xi+1},\ldots,X_B) as a Markov chain with transition matrix P that starts at X_\xi = x. Likewise we generate (X_\xi, X_{\xi-1},\ldots, X_0) as a Markov chain that starts from Q.

We then apply T to each of (X_0,X_1,\ldots,X_\xi,\ldots,X_B) and count the number of b \in \{0,1,\ldots,B\} such that T(X_b) \ge T(X_\xi) = T(x). If there are k such b, then the reported p-value of our test is k/(B+1).

We will now show that this test produces a valid p-value. That is, when x \sim \pi, the probability that k/(B+1) is less than \alpha is at most \alpha. In symbols,

\mathbb{P}\left( \frac{\#\{b: T(X_b) \ge T(X_\xi)\}}{B+1} \le \alpha \right) \le \alpha

Under the null hypothesis x \sim \pi, (X_\xi, X_{\xi-1},\ldots, X_0) is equal in distribution to generating X_0 from \pi and using the transition matrix P to go from X_i to X_{i+1}.

Thus, under the null hypothesis, the distribution of (X_0,X_1,\ldots,X_B) does not depend on \xi. The entire procedure is equivalent to generating a Markov chain \mathbf{X} = (X_0,X_1,\ldots,X_B) with initial distribution \pi and transition matrix P, and then choosing \xi \in \{0,1,\ldots,B\} independently of \mathbf{X}. This is enough to show that the serial method produces valid p-values. The idea is that since the distribution of \mathbf{X} does not depend on \xi and \xi is uniformly distributed on \{0,1,\ldots,B\}, the probability that T(X_\xi) is in the top \alpha proportion of T(X_b) should be at most \alpha. This is proved more formally below.

For each i \in \{0,1,\ldots,B\}, let A_i be the event that T(X_i) is in the top \alpha proportion of T(X_0),\ldots,T(X_B). That is,

A_i = \left\{ \frac{\# \{b : T(X_b) \ge T(X_i)\} }{B+1} \le \alpha \right\}.

Let I_{A_i} be the indicator function for the event A_i. Since at must \alpha(B+1) values of i can be in the top \alpha fraction of T(X_0),\ldots,T(X_B), we have that

\sum_{i=0}^B I_{A_i} \le \alpha(B+1),

Therefor, by linearity of expectations,

\sum_{i=0}^B \mathbb{P}(A_i) \le \alpha(B+1)

By the law of total probability we have,

\mathbb{P}\left(\frac{\#\{b: T(X_b) \ge T(X_\xi)\}}{B+1} \le \alpha \right) = \sum_{i=0}^B \mathbb{P}\left(\frac{\#\{b: T(X_b) \ge T(X_\xi)\}}{B+1} \le \alpha | \xi = i\right)\mathbb{P}(\xi = i),

Since \xi is uniform on \{0,\ldots,B\}, \mathbb{P}(\xi = i) is \frac{1}{B+1} for all i. Furthermore, by independence of \xi and \mathbf{X}, we have

\mathbb{P}\left(\frac{\#\{b: T(X_b) \ge T(X_\xi)\}}{B+1} \le \alpha | \xi = i\right) = \mathbb{P}\left(\frac{\#\{b: T(X_b) \ge T(X_i)\}}{B+1} \le \alpha | \xi = i\right) = \mathbb{P}(A_i).

Thus, by our previous bound,

\mathbb{P}\left(\frac{\#\{b: T(X_b) \ge T(X_\xi)\}}{B+1} \le \alpha \right) = \frac{1}{B+1}\sum_{i=0}^B \mathbb{P}(A_i) \le \alpha.

Applications

In the original paper by Besag and Clifford, the authors discuss how this procedure can be used to perform goodness-of-fit tests. They construct Markov chains that can test the Rasch model or the Ising model. More broadly the method can be used to tests goodness-of-fit tests for any exponential family by using the Markov chains developed by Diaconis and Sturmfels.

A similar method has also been applied more recently to detect Gerrymandering. In this setting, the null hypothesis is the uniform distribution on all valid redistrictings of a state and the test statistics T measure the political advantage of a given districting.

References

  1. Besag, Julian, and Peter Clifford. “Generalized Monte Carlo Significance Tests.” Biometrika 76, no. 4 (1989)
  2. Persi Diaconis, Bernd Sturmfels “Algebraic algorithms for sampling from conditional distributions,” The Annals of Statistics, Ann. Statist. 26(1), 363-397, (1998)
  3. Chikina, Maria et al. “Assessing significance in a Markov chain without mixing.” Proceedings of the National Academy of Sciences of the United States of America vol. 114,11 (2017)

How to Bake Pi, Sherman-Morrison and log-sum-exp

A few months ago, I had the pleasure of reading Eugenia Cheng‘s book How to Bake Pi. Each chapter starts with a recipe which Cheng links to the mathematical concepts contained in the chapter. The book is full of interesting connections between mathematics and the rest of the world.

One of my favourite ideas in the book is something Cheng writes about equations and the humble equals sign: =. She explains that when an equation says two things are equal we very rarely mean that they are exactly the same thing. What we really mean is that the two things are the same in some ways even though they may be different in others.

One example that Cheng gives is the equation a + b = b+a. This is such a familiar statement that you might really think that a+b and b+a are the same thing. Indeed, if a and b are any numbers, then the number you get when you calculate a + b is the same as the number you get when you calculate b + a. But calculating a+b could be very different from calculating b+a. A young child might calculate a+b by starting with a and then counting one-by-one from a to a + b. If a is 1 and b is 20, then calculating a + b requires counting from 1 to 21 but calculating b+a simply amounts to counting from 20 to 21. The first process takes way longer than the second and the child might disagree that 1 + 20 is the same as 20 + 1.

In How to Bake Pi, Cheng explains that a crucial idea behind equality is context. When someone says that two things are equal we really mean that they are equal in the context we care about. Cheng talks about how context is crucial through-out mathematics and introduces a little bit of category theory as a tool for moving between different contexts. I think that this idea of context is really illuminating and I wanted to share some examples where “=” doesn’t mean “exactly the same as”.

The Sherman-Morrison formula

The Sherman-Morrison formula is a result from linear algebra that says for any invertible matrix A \in \mathbb{R}^{n\times n} and any pair of vectors u,v \in \mathbb{R}^n, if v^TA^{-1}u \neq -1, then A + uv^T is invertible and

(A + uv^T)^{-1} = A^{-1} + \frac{A^{-1}uv^TA^{-1}}{1 + v^TA^{-1}u}

Here “=” means the following:

  1. You can take any natural number n, any matrix A of size n by n, and any length n-vectors u and v that satisfy the above condition.
  2. If you take all those things and carry out all the matrix multiplications, additions and inversions on the left and all the matrix multiplications, additions and inversions on the right, then you will end up with exactly the same matrix in both cases.

But depending on the context, the equation on one side of “=” may be much easier than the other. Although the right hand side looks a lot more complicated, it is much easier to compute in one important context. This context is when we have already calculated the matrix A^{-1} and now want the inverse of A + uv^T. The left hand side naively computes A + uv^T which takes O(n^3) computations since we have to invert a n \times n matrix. On the right hand side, we only need to compute a small number of matrix-vector products and then add two matrices together. This bring the computational cost down to O(n^2).

These cost saving measures come up a lot when studying linear regression. The Sherman-Morrison formula can be used to update regression coefficients when a new data point is added. Similarly, the Sherman-Morrison formula can be used to quickly calculate the fitted values in leave-one-out cross validation.

log-sum-exp

This second example also has connections to statistics. In a mixture model, we assume that each data point Y comes from a distribution of the form:

p(y|\pi,\theta) = \sum_{k=1}^K \pi_k p(y | \theta_k),

where \pi is a vector and \pi_k is equal to the probability that Y came from class k. The parameters \theta_k \in\mathbb{R}^p are the parameters for the k^{th} group.The log-likelihood is thus,

\log\left(\sum_{k=1}^K \pi_k p(y | \theta_k)\right) = \log\left(\sum_{k=1}^K \exp(\eta_{k})\right),

where \eta_{k} = \log(\pi_k p(y| \theta_k)). We can see that the log-likelihood is of the form log-sum-exp. Calculating a log-sum-exp can cause issues with numerical stability. For instance if K = 3 and \eta_k = 1000, for all k=1,2,3, then the final answer is simply \log(3)+1000. However, as soon as we try to calculate \exp(1000) on a computer, we’ll be in trouble.

The solution is to use the following equality, for any \beta \in \mathbb{R},

\log\left(\sum_{k=1}^K \exp(\eta_k) \right) = \beta + \log\left(\sum_{k=1}^K \exp(\beta - \eta_k)\right).

Proving the above identity is a nice exercise in the laws of logarithm’s and exponential’s, but with a clever choice of \beta we can more safely compute the log-sum-exp expression. For instance, in the documentation for pytorch’s implementation of logsumexp() they take \beta to be the maximum of \eta_k. This (hopefully) makes each of the terms \beta - \eta_k a reasonable size and avoids any numerical issues.

Again, the left and right hand sides of the above equation might be the same number, but in the context of having to use computers with limited precision, they represent very different calculations.

Beyond How to Bake Pi

Eugenia Cheng has recently published a new book called The Joy of Abstraction. I’m just over half way through and it’s been a really engaging and interesting introduction to category theory. I’m looking forward to reading the rest of it and getting more insight from Eugenia Cheng’s great mathematical writing.

Looking back on the blog

Next week I’ll be starting the second year of my statistics PhD. I’ve learnt a lot from taking the first year classes and studying for the qualifying exams. Some of what I’ve learnt has given me a new perspective on some of my old blog posts. Here are three things that I’ve written about before and I now understand better.

1. The Pi-Lambda Theorem

An early post on this blog was titled “A minimal counterexample in probability theory“. The post was about a theorem from the probability course offered at the Australian National University. The theorem states that if two probability measures agree on a collection of subsets and the collection is closed under finite intersections, then the two probability measures agree on the \sigma-algebra generated by the collection. In my post I give an example which shows that you need the collection to be closed under finite intersections. I also show that you need to have at least four points in the space to find such an example.

What I didn’t know then is that the above theorem is really a corollary of Dykin’s \pi - \lambda theorem. This theorem was proved in my first graduate probability course which was taught by Persi Diaconis. Professor Diaconis kept a running tally of how many times we used the \pi - \lambda theorem in his course and we got up to at least 10. (For more appearances by Professor Diaconis on this blog see here, here and here).

If I were to write the above post again, I would talk about the \pi - \lambda theorem and rename the post “The smallest \lambda-system”. The example given in my post is really about needing at least four points to find a \lambda-system that is not a \sigma-algebra.

2. Mallow’s Cp statistic

The very first blog post I wrote was called “Complexity Penalties in Statistical Learning“. I wasn’t sure if I would write a second and so I didn’t set up a WordPress account. I instead put the post on the website LessWrong. I no longer associate myself with the rationality community but posting to LessWrong was straight forward and helped me reach more people.

The post was inspired in two ways by the 2019 AMSI summer school. First, the content is from the statistical learning course I took at the summer school. Second, at the career fair many employers advised us to work on our writing skills. I don’t know if would have started blogging if not for the AMSI Summer School.

I didn’t know it at the time but the blog post is really about Mallow’s Cp statistic. Mallow’s Cp statistic is an estimate of the test error of a regression model fit using ordinary least squares. The Mallow’s Cp is equal to the training error plus a “complexity penalty” which takes into account the number of parameters. In the blog post I talk about model complexity and over-fitting. I also write down and explain Mallow’s Cp in the special case of polynomial regression.

In the summer school course I took, I don’t remember the name Mallow’s Cp being used but I thought it was a great idea and enjoyed writing about it. The next time encountered Mallow’s Cp was in the linear models course I took last fall. I was delighted to see it again and learn how it fit into a bigger picture. More recently, I read Brad Efron’s paper “How Biased is the Apparent Error Rate of a Prediction Rule?“. The paper introduces the idea of “effective degrees of freedom” and expands on the ideas behind the Cp statistic.

Incidentally, enrolment is now open for the 2023 AMSI Summer School! This summer it will be hosted at the University of Melbourne. I encourage any Australia mathematics or statistics students reading this to take a look and apply. I really enjoyed going in both 2019 and 2020. (Also if you click on the above link you can try to spot me in the group photo of everyone wearing read shirts!)

3. Finitely additive probability measures

In “Finitely additive measures” I talk about how hard it is to define a finitely additive measure on the set of natural numbers that is not countably additive. In particular, I talked about needing to use the Hahn — Banach extension theorem to extend the natural density from the collection of sets with density to the collection of all subsets of the natural numbers.

There were a number of homework problems in my first graduate probability course that relate to this post. We proved that the sets with density are not closed under finite unions and we showed that the square free numbers have density 6/\pi^2.

We also proved that any finite measure defined on an algebra of subsets can be extend to the collection of all subsets. This proof used Zorn’s lemma and the resulting measure is far from unique. The use of Zorn’s lemma relates to the main idea in my blog, that defining an additive probability measure is in some sense non-constructive.

Other posts

Going forward, I hope to continue publishing at least one new post every month. I look forward to one day writing another post like this when I can look back and reflect on how much I have learnt.

Total Variation and Marathon Running

Total variation is a way of measuring how much a function f:[a,b] \to \mathbb{R} “wiggles”. In this post, I want to motivate the definition of total variation by talking about elevation in marathon running.

Comparing marathon courses

On July 24th I ran the 2022 San Francisco (SF) marathon. All marathons are the same distance, 42.2 kilometres (26.2 miles) but individual courses can vary greatly. Some marathons are on road and others are on trails. Some locations can be hot and others can be rainy. And some, such as the SF marathon, can be much hillier than others. Below is a plot comparing the elevation of the Canberra marathon I ran last year to the elevation of the SF marathon:

A plot showing the relative elevation over the course of the Canberra and San Francisco marathons. Try to spot the two times I ran over the Golden Gate Bridge during the SF marathon.

Immediately, you can see that the range of elevation during the San Francisco marathon was much higher than the range of elevation during the Canberra marathon. However, what made the SF marathon hard wasn’t any individual incline but rather the sheer number of ups and downs. For comparison, the plot below shows elevation during a 32 km training run and elevation during the SF marathon:

A plot showing the relative elevation over the course of a training run and the San Francisco marathon. The big climb during my training run is the Stanford dish.

You can see that my training run was mostly flat but had one big hill in the last 10 kilometres. The maximum relative elevation on my training run was about 50 meters higher than the maximum relative elevation of the marathon, but overall the training run graph is a lot less wiggly. This meant there were far more individual hills during the marathon and so the first 32 km of the marathon felt a lot tougher than the training run. By comparing these two runs, you can see that the elevation range can hide important information about the difficulty of a run. We also need to pay attention to how wiggly the elevation curve is.

Wiggliness Scores

So far our definition of wiggliness has been imprecise and has relied on looking at a graph of the elevation. This makes it hard to compare two runs and quickly decide which one is wigglier. It would be convenient if there was a “wiggliness score” – a single number we could assign to each run which measured the wiggliness of the run’s elevation. Bellow we’ll see that total variation does exactly this.

If we zoom in on one of the graphs above, we would see that it actually consists of tiny straight line segments. For example, let’s look at the 22nd kilometre of the SF marathon. In this plot it looks like elevation is a smooth function of distance:

The relative elevation of the 22nd kilometre during the SF marathon.

But if we zoom in on a 100 m stretch, then we see that the graph is actually a series of straight lines glued together:

The relative elevation over a 100 metres during the marathon.

This is because these graphs are made using my GPS watch which makes one recording per second. If we place dots at each of these times, then the straight lines become clearer:

The relative elevation over the same 100 metre stretch. Each blue dots marks a point when a measurement was made.

We can use these blue dots to define the graph’s wiggliness score. The wiggliness score should capture how much the graph varies across its domain. This suggests that wiggliness scores should be additive. By additive, I mean that if we split the domain into a finite number of pieces, then the wiggliness score across the whole domain should be the sum of the wiggliness score of each segment.

In particular, the wiggliness score for the SF marathon is equal to the sum of the wiggliness score of each section between two consecutive blue dots. This means we only need to quantify how much the graph varies between consecutive blue dots. Fortunately, between two such dots, the graph is a straight line. The amount that a straight line varies is simply the distance between the y-value at the start and the y-value at the end. Thus, by adding up all these little distances we can get a wiggliness score for the whole graph. This wiggliness score is used in mathematics, where it is called the total variation.

Here are the wiggliness scores for the three runs shown above:

RunWiggliness score
Canberra Marathon 2021617 m
Training run742 m
SF Marathon 20222140 m
The total variation or wiggliness score for the three graphs shown above.

Total Variation

We’ve seen that by breaking up a run into little pieces, we can calculate the total variation over the course of the run. But how can we calculate the total variation of an arbitrary function f:[a,b] \to \mathbb{R}?

Our previous approach won’t work because the function f might not be made up of straight lines. But we can approximate f with other functions that are made of straight lines. We can calculate the total variation of these approximations using the approach we used for the marathon runs. Then we define the total variation of f as the limit of the total variation of each of these approximations.

To make this precise, we will work with partitions of [a,b]. A partition of [a,b] is a finite set of points P = \{x_0, x_1,\ldots,x_n\} such that:

a = x_0 < x_1 < \ldots < x_n = b.

That is, x_0, x_1,\ldots,x_n is a collection of increasing points in [a,b] that start at a and end at b. For a given partition P of [a,b], we calculate how much the function f varies over the points in the partition P. As with the blue dots above, we can simply add up the distance between consecutive y values f(x_i) and f(x_{i-1}). In symbols, we define V_P(f) (the variation over f over the partition P) to be:

V_P(f) = \sum\limits_{i=1}^n |f(x_i)-f(x_{i-1})|.

To define the variation of f over the interval [a,b], we can imagine taking finer and finer partitions of [a,b]. To do this, note that whenever we add more points to a partition, the total variation over that partition can only increase. Thus, we can think of the total variation of f as the maximum total variation over any partition. We denote the total variation of f by V(f) and define it as:

V(f) = \sup\{V_P(f) : P \text{ is a partition of } [a,b]\}.

Surprisingly, there exist continuous function for which the total variation is infinite. Sample paths of the Brownian motion are canonical examples of continuous functions with infinite total variation. Such functions would be very challenging runs.

Some Limitations

Total variation does a good job of measuring how wiggly a function is but it has some limitations when applied to course elevation. The biggest issue is that total variation treats inclines and declines symmetrically. A steep line sloping down increases the total variation by the same amount as a line with the same slope going upwards. This obviously isn’t true when running; an uphill is very different to a downhill.

To quantify how much a function wiggles upwards, we could use the same ideas but replace the absolute value |f(x_i)-f(x_{i-1})| with the positive part (f(x_i)-f(x_{i-1}))_+ = \max\{f(x_i)-f(x_{i-1}),0\}. This means that only the lines that slope upwards will count towards the wiggliness score. Lines that slope downwards get a wiggliness score of zero.

Another limitation of total variation is that it measures total wiggliness across the whole domain rather than average wiggliness. This isn’t much of a problem when comparing runs of a similar length, but when comparing runs of different lengths, total variation can give surprising results. Below is a comparison between the Australian Alpine Ascent and the SF marathon:

The Australian Alpine Ascent is a 25 km run that goes up Australia’s tallest mountain. Despite the huge climbs during the Australian Alpine Ascent, the SF marathon has a higher total variation. Since the Australian Alpine Ascent was shorter, it gets a lower wiggliness score (1674 m vs 2140 m). For this comparison it would be better to divide each wiggliness score by the runs’ distance.

Summary

Despite these limitations, I still think that total variation is a useful metric for comparing two runs. It doesn’t tell you exactly how tough a run will be but if you already know the run’s distance and starting/finishing elevation, then the total variation helps you know what to expect.

Finding Australia’s youngest electorates with R

My partner recently wrote an article for Changing Times, a grassroots newspaper that focuses on social change. Her article, Who’s not voting? Engaging with First Nations voters and young voters, is about voter turn-out in Australia and an important read.

While doing research for the article, she wanted to know which electorates in Australia had the highest proportion of young voters. Fortunately the Australian Electoral Commission (AEC) keeps a detailed record of the number of electors in each electorate available here. The records list the number of voters of a given sex and age bracket in each of Australia’s 153 electorates. To calculate and sort the proportion of young voters (18-29) in each electorate using the most recent records, I wrote the below R code for her:

library(tidyverse)

voters <- read_csv("elector-count-fe-2022.csv", 
                   skip = 8,
                   col_names = c("Description",
                                 "18-19", 
                                 "20-24", 
                                 "25-29", 
                                 "30-34", 
                                 "35-39", 
                                 "40-44", 
                                 "45-49", 
                                 "50-54", 
                                 "55-59", 
                                 "60-64", 
                                 "65-69", 
                                 "70+",  
                                 "Total"),
                   col_select = (1:14),
                   col_types = cols(Description = "c",
                                    .default = "n"))


not_electorates = c("NSW", "VIC", "ACT", "WA", "SA", "TAS", 
                    "QLD", "NT", "Grand Total","Female", 
                    "Male", "Indeterminate/Unknown")

electorates <- voters %>% 
  filter(!(Description %in% not_electorates),
         !is.na(Description))  

young_voters <- electorates %>% 
  mutate(Young = `18-19` + `20-24`,
         Proportion = Young/Total,
         rank = min_rank(desc(Proportion))) %>% 
  arrange(rank) %>% 
  select(Electorate = Description, Total, Young, Proportion, rank) 

young_voters

To explain what I did, I’ll first describe the format of the data set from the AEC. The first column contained a description of each row. All the other columns contained the number of voters in a given age bracket. Rows either corresponded to an electorate or a state and either contained the total electorate or a given sex.

For the article my partner wanted to know which electorate had the highest proportion of young voters (18-24) in total so we removed the rows for each state and the rows that selected a specific sex. The next step was to calculate the number of young voters across the age brackets 18-19 and 20-24 and then calculate the proportion of such voters. Once ranked, the five electorates with the highest proportion of young voters were

ElectorateProportion of young voters
Ryan0.146
Brisbane0.132
Griffith0.132
Canberra0.131
Kooyong0.125

In Who’s not Voting?, my partner points out that the three seats with the highest proportion of voters all swung to the Greens at the most recent election. To view all the proportion of young voters across every electorate, you can download this spreadsheet.