# MCMC for hypothesis testing

A Monte Carlo significance test of the null hypothesis $X_0 \sim H$ requires creating independent samples $X_1,\ldots,X_B \sim H$. The idea is if $X_0 \sim H$ and independently $X_1,\ldots,X_B$ are i.i.d. from $H$, then for any test statistic $T$, the rank of $T(X_0)$ among $T(X_0), T(X_1),\ldots, T(X_B)$ is uniformly distributed on $\{1,\ldots, B+1\}$. This means that if $T(X_0)$ is one of the $k$ largest values of $T(X_b)$, then we can reject the hypothesis $X \sim H$ at the significance level $k/(B+1)$.

The advantage of Monte Carlo significance tests is that we do not need an analytic expression for the distribution of $T(X_0)$ under $X_0 \sim H$. By generating the i.i.d. samples $X_1,\ldots,X_B$, we are making an empirical distribution that approximates the theoretical distribution. However, sometimes sampling $X_1,\ldots, X_B$ is just as intractable as theoretically studying the distribution of $T(X_0)$ . Often approximate samples based on Markov chain Monte Carlo (MCMC) are used instead. However, these samples are not independent and may not be sampling from the true distribution. This means that a test using MCMC may not be statistically valid

In the 1989 paper Generalized Monte Carlo significance tests, Besag and Clifford propose two methods that solve this exact problem. Their two methods can be used in the same settings where MCMC is used but they are statistically valid and correctly control the Type 1 error. In this post, I will describe just one of the methods – the serial test.

## Background on Markov chains

To describe the serial test we will need to introduce some notation. Let $P$ denote a transition matrix for a Markov chain on a discrete state space $\mathcal{X}.$ A Markov chain with transition matrix $P$ thus satisfies,

$\mathbb{P}(X_n = x_n \mid X_0 = x_0,\ldots, X_{n-1} = x_{n-1}) = P(x_{n-1}, x_n)$

Suppose that the transition matrix $P$ has a stationary distribution $\pi$. This means that if $(X_0, X_1,\ldots)$ is a Markov chain with transition matrix $P$ and $X_0$ is distributed according to $\pi$, then $X_1$ is also distributed according to $\pi$. This implies that all $X_i$ are distributed according to $\pi$.

We can construct a new transition matrix $Q$ from $P$ and $\pi$ by $Q(y,x) = P(x,y) \frac{\pi(x)}{\pi(y)}$. The transition matrix $Q$ is called the reversal of $P$. This is because for all $x$ and $y$ in $\mathcal{X}$, $\pi(x)P(x,y) = \pi(y)Q(x,y)$. That is the chance of drawing $x$ from $\pi$ and then transitioning to $y$ according to $P$ is equal to the chance of drawing $y$ from $\pi$ and then transitioning to $x$ according to $Q$

The new transition matrix $Q$ also allows us to reverse longer runs of the Markov chain. Fix $n \ge 0$ and let $(X_0,X_1,\ldots,X_n)$ be a Markov chain with transition matrix $P$ and initial distribution $\pi$. Also let $(Y_0,Y_1,\ldots,Y_n)$ be a Markov chain with transition matrix $Q$ and initial distribution $\pi$, then

$(X_0,X_1,\ldots,X_{n-1}, X_n) \stackrel{dist}{=} (Y_n,Y_{n-1},\ldots, Y_1,Y_0)$,

where $\stackrel{dist}{=}$ means equal in distribution.

## The serial test

Suppose we want to test the hypothesis $x \sim \pi$ where $x \in \mathcal{X}$ is our observed data and $\pi$ is some distribution on $\mathcal{X}$. To conduct the serial test, we need to construct a Markov chain $P$ for which $\pi$ is a stationary distribution. We then also need to construct the reversal $Q$ described above. There are many possible ways to construct $P$ such as the Metropolis-Hastings algorithm.

We also need a test statistic $T$. This is a function $T:\mathcal{X} \to \mathbb{R}$ which we will use to detect outliers. This function is the same function we would use in a regular Monte Carlo test. Namely, we want to reject the null hypothesis when $T(x)$ is much larger than we would expect under $x \sim \pi$.

The serial test then proceeds as follows. First we pick $\xi$ uniformly in $\{0,1,\ldots,B\}$ and set $X_\xi = x$. We then generate $(X_\xi, X_{\xi+1},\ldots,X_B)$ as a Markov chain with transition matrix $P$ that starts at $X_\xi = x$. Likewise we generate $(X_\xi, X_{\xi-1},\ldots, X_0)$ as a Markov chain that starts from $Q$.

We then apply $T$ to each of $(X_0,X_1,\ldots,X_\xi,\ldots,X_B)$ and count the number of $b \in \{0,1,\ldots,B\}$ such that $T(X_b) \ge T(X_\xi) = T(x)$. If there are $k$ such $b$, then the reported p-value of our test is $k/(B+1)$.

We will now show that this test produces a valid p-value. That is, when $x \sim \pi$, the probability that $k/(B+1)$ is less than $\alpha$ is at most $\alpha$. In symbols,

$\mathbb{P}\left( \frac{\#\{b: T(X_b) \ge T(X_\xi)\}}{B+1} \le \alpha \right) \le \alpha$

Under the null hypothesis $x \sim \pi$, $(X_\xi, X_{\xi-1},\ldots, X_0)$ is equal in distribution to generating $X_0$ from $\pi$ and using the transition matrix $P$ to go from $X_i$ to $X_{i+1}$.

Thus, under the null hypothesis, the distribution of $(X_0,X_1,\ldots,X_B)$ does not depend on $\xi$. The entire procedure is equivalent to generating a Markov chain $\mathbf{X} = (X_0,X_1,\ldots,X_B)$ with initial distribution $\pi$ and transition matrix $P$, and then choosing $\xi \in \{0,1,\ldots,B\}$ independently of $\mathbf{X}$. This is enough to show that the serial method produces valid p-values. The idea is that since the distribution of $\mathbf{X}$ does not depend on $\xi$ and $\xi$ is uniformly distributed on $\{0,1,\ldots,B\}$, the probability that $T(X_\xi)$ is in the top $\alpha$ proportion of $T(X_b)$ should be at most $\alpha$. This is proved more formally below.

For each $i \in \{0,1,\ldots,B\}$, let $A_i$ be the event that $T(X_i)$ is in the top $\alpha$ proportion of $T(X_0),\ldots,T(X_B)$. That is,

$A_i = \left\{ \frac{\# \{b : T(X_b) \ge T(X_i)\} }{B+1} \le \alpha \right\}$.

Let $I_{A_i}$ be the indicator function for the event $A_i$. Since at must $\alpha(B+1)$ values of $i$ can be in the top $\alpha$ fraction of $T(X_0),\ldots,T(X_B)$, we have that

$\sum_{i=0}^B I_{A_i} \le \alpha(B+1)$,

Therefor, by linearity of expectations,

$\sum_{i=0}^B \mathbb{P}(A_i) \le \alpha(B+1)$

By the law of total probability we have,

$\mathbb{P}\left(\frac{\#\{b: T(X_b) \ge T(X_\xi)\}}{B+1} \le \alpha \right) = \sum_{i=0}^B \mathbb{P}\left(\frac{\#\{b: T(X_b) \ge T(X_\xi)\}}{B+1} \le \alpha | \xi = i\right)\mathbb{P}(\xi = i)$,

Since $\xi$ is uniform on $\{0,\ldots,B\}$, $\mathbb{P}(\xi = i)$ is $\frac{1}{B+1}$ for all $i$. Furthermore, by independence of $\xi$ and $\mathbf{X}$, we have

$\mathbb{P}\left(\frac{\#\{b: T(X_b) \ge T(X_\xi)\}}{B+1} \le \alpha | \xi = i\right) = \mathbb{P}\left(\frac{\#\{b: T(X_b) \ge T(X_i)\}}{B+1} \le \alpha | \xi = i\right) = \mathbb{P}(A_i)$.

Thus, by our previous bound,

$\mathbb{P}\left(\frac{\#\{b: T(X_b) \ge T(X_\xi)\}}{B+1} \le \alpha \right) = \frac{1}{B+1}\sum_{i=0}^B \mathbb{P}(A_i) \le \alpha$.

## Applications

In the original paper by Besag and Clifford, the authors discuss how this procedure can be used to perform goodness-of-fit tests. They construct Markov chains that can test the Rasch model or the Ising model. More broadly the method can be used to tests goodness-of-fit tests for any exponential family by using the Markov chains developed by Diaconis and Sturmfels.

A similar method has also been applied more recently to detect Gerrymandering. In this setting, the null hypothesis is the uniform distribution on all valid redistrictings of a state and the test statistics $T$ measure the political advantage of a given districting.