Poisson approximations to the negative binomial distribution

This post is an introduction to the negative binomial distribution and a discussion of different ways of approximating the negative binomial distribution.

The negative binomial distribution describes the number of times a coin lands on tails before a certain number of heads are recorded. The distribution depends on two parameters p and r. The parameter p is the probability that the coin lands on heads and r is the number of heads. If X has the negative binomial distribution, then X = x means in the first x+r-1 tosses of the coin, there were r-1 heads and that toss number x+r was a head. This means that the probability that X=x is given by

\displaystyle{f(x) = \binom{x+r-1}{r-1}p^{r}\left(1-p\right)^x}

Here is a plot of the function f(x) for different values of r and p.

Poisson approximations

When the parameter r is large and p is close to one, the negative binomial distribution can be approximated by a Poisson distribution. More formally, suppose that r(1-p)=\lambda for some positive real number \lambda. If r is large then, the negative binomial random variable with parameters p and r, converges to a Poisson random variable with parameter \lambda. This is illustrated in the picture below where three negative binomial distributions with r(1-p)=5 approach the Poisson distribution with \lambda =5.

Total variation distance is a common way to measure the distance between two discrete probability distributions. The log-log plot below shows that the error from the Poisson approximation is on the order of 1/r and that the error is bigger if the limiting value of r(1-p) is larger.

It turns out that is is possible to get a more accurate approximation by using a different Poisson distribution. In the first approximation, we used a Poisson random variable with mean \lambda = r(1-p). However, the mean of the negative binomial distribution is r(1-p)/p. This suggests that we can get a better approximation by setting \lambda = r(1-p)/p.

The change from \lambda = r(1-p) to \lambda = r(1-p)/p is a small because p \approx 1. However, this small change gives a much better approximation, especially for larger values of r(1-p). The below plot shows that both approximations have errors on the order of 1/r, but the constant for the second approximation is much better.

Second order accurate approximation

It is possible to further improve the Poisson approximation by using a Gram–Charlier expansion. A Gram–Charlier approximation for the Poisson distribution is given in this paper.1 The approximation is

\displaystyle{f_{GC}(x) = P_\lambda(x) - \frac{1}{2}(1-p)\left((x-\lambda)P_\lambda(x)-(x-1-\lambda)P_\lambda(x-1)\right)},

where \lambda = \frac{k(1-p)}{p} as in the second Poisson approximation and P_\lambda(x) is the Poisson pmf evaluated at x.

The Gram–Charlier expansion is considerably more accurate than either Poisson approximation. The errors are on the order of 1/r^2. This higher accuracy means that the error curves for the Gram–Charlier expansion has a steeper slope.

  1. The approximation is given in equation (4) of the paper and is stated in terms of the CDF instead of the PMF. The equation also contains a small typo, it should say \frac{1}{2}q instead of \frac{1]{2}p. ↩︎

“Uniformly random”

The term “uniformly random” sounds like a contradiction. How can the word “uniform” be used to describe anything that’s random? Uniformly random actually has a precise meaning, and, in a sense, means “as random as possible.” I’ll explain this with an example about shuffling card.

Shuffling cards

Suppose I have a deck of ten cards labeled 1 through 10. Initially, the cards are face down and in perfect order. The card labeled 10 is on top of the deck. The card labeled 9 is second from the top, and so on down to the card labeled 1. The cards are definitely not random.

Next, I generate a random number between 1 and 10. I then find the card with the corresponding label and put it face down on top of the deck. The cards are now somewhat random. The number on top could anything, but the rest of the cards are still in order. The cards are random but they are not uniformly random.

Now suppose that I keep generating random numbers and moving cards to the top of the deck. Each time I do this, the cards get more random. Eventually (after about 30 moves1) the cards will be really jumbled up. Even if you knew the first few cards, it would be hard to predict the order of the remaining ones. Once the cards are really shuffled, they are uniformly random.

Uniformly random

A deck of cards is uniformly random if each of the possible arrangements of the cards are equally likely. After only moving one card, the deck of cards is not uniformly random. This is because there are only 10 possible arrangements of the deck. Once the deck is well-shuffled, all of the 3,628,800 possible arrangements are equally likely.

In general, something is uniformly random if each possibility is equally likely. So the outcome of rolling a fair 6-sided die is uniformly random, but rolling a loaded die is not. The word “uniform” refers to the chance of each possibility (1/6 for each side of the die). These chances are all the same and “uniform”.

This is why uniformly random can mean “as random as possible.” If you are using a fair die or a well-shuffled deck, there are no biases in the outcome. Mathematically, this means you can’t predict the outcome.

Communicating research

The inspiration for this post came from a conversation I had earlier in the week. I was telling someone about my research. As an example, I talked about how long it takes for a deck of cards to become uniformly random. They quickly stopped me and asked how the two words could ever go together. It was a good point! I use the words uniformly random all the time and had never realized this contradiction.2 It was a good reminder about the challenge of clear communication.

Footnotes

  1. The number of moves it takes for the deck to well-shuffled is actually random. But on average it takes around 30 moves. For the mathematical details, see Example 1 in Shuffling Cards and Stopping Times by David Aldous and Persi Diaconis. ↩︎
  2. Of the six posts I published last year, five contain the word uniform! ↩︎

Understanding the Ratio of Uniforms Distribution

The ratio of uniforms distribution is a useful distribution for rejection sampling. It gives a simple and fast way to sample from discrete distributions like the hypergeometric distribution1. To use the ratio of uniforms distribution in rejection sampling, we need to know the distributions density. This post summarizes some properties of the ratio of uniforms distribution and computes its density.

The ratio of uniforms distribution is the distribution of the ratio of two independent uniform random variables. Specifically, suppose U \in [-1,1] and V \in [0,1] are independent and uniformly distributed. Then R = U/V has the ratio of uniforms distribution. The plot below shows a histogram based on 10,000 samples from the ratio of uniforms distribution2.

The histogram has a flat section in the middle and then curves down on either side. This distinctive shape is called a “table mountain”. The density of R also has a table mountain shape.

And here is the density plotted on top of the histogram.

A formula for the density of R is

\displaystyle{h(R) = \begin{cases} \frac{1}{4} & \text{if } -1 \le R \le 1, \\\frac{1}{4R^2} & \text{if } R < -1 \text{ or } R > 1.\end{cases}}

The first case in the definition of h corresponds to the flat part of the table mountain. The second case corresponds to the sloping curves. The rest of this post use geometry to derive the above formula for h(R).

Calculating the density

The point (U,V) is uniformly distributed in the box B=[-1,1] \times [0,1]. The image below shows an example of a point (U,V) inside the box B.

We can compute the ratio R = U/V geometrically. First we draw a straight line that starts at (0,0) and goes through (U,V). This line will hit the horizontal line y=1. The x coordinate at this point is exactly R=U/V.

In the above picture, all of the points on the dashed line map to the same value of R. We can compute the density of R by computing an area. The probability that R is in a small interval [R,R+dR] is

\displaystyle{\frac{\text{Area}(\{(u,v) \in B : u/v \in [R, R+dR]\})}{\text{Area}(B)} = \frac{1}{2}\text{Area}(\{(u,v) \in B : u/v \in [R, R+dR]\}).}

If we can compute the above area, then we will know the density of R because by definition

\displaystyle{h(R) = \lim_{dR \to 0} \frac{1}{2dR}\text{Area}(\{(u,v) \in B : u/v \in [R, R+dR]\})}.

We will first work on the case when R is between -1 and 1. In this case, the set \{(u,v) \in B : u/v \in [R, R+dR]\} is a triangle. This triangle is drawn in blue below.

The horizontal edge of this triangle has length dR. The perpendicular height of the triangle from the horizontal edge is 1. This means that

\displaystyle{\text{Area}(\{(u,v) \in B : u/v \in [R, R+dR]\}) =\frac{1}{2}\times dR \times 1=\frac{dR}{2}}.

And so, when R \in [-1,1] we have

\displaystyle{h(R) = \lim_{dR\to 0} \frac{1}{2dR}\times \frac{dR}{2}=\frac{1}{4}}.

Now let’s work on the case when R is bigger than 1 or less than -1. In this case, the set \{(u,v) \in B : u/v \in [R, R+dR]\} is again triangle. But now the triangle has a vertical edge and is much skinnier. Below the triangle is drawn in red. Note that only points inside the box B are coloured in.

The vertical edge of the triangle has length \frac{1}{R} - \frac{1}{R+dR}= \frac{dR}{R(R+dR)}. The perpendicular height of the triangle from the vertical edge is 1. Putting this together

\displaystyle{\text{Area}(\{(u,v) \in B : u/v \in [R, R+dR]\}) =\frac{1}{2}\times \frac{dR}{R(R+dR)} \times 1=\frac{dR}{2R(R+dR)}}.

And so

\displaystyle{h(R) = \lim_{dR \to 0} \frac{1}{2dR} \times \frac{dR}{2 R(R+dR)} = \frac{1}{4R^2}}.

And so putting everything together

\displaystyle{h(R) = \begin{cases} \frac{1}{4} & \text{if } -1 \le R \le 1, \\\frac{1}{4R^2} & \text{if } R < -1 \text{ or } R > 1.\end{cases}}

Footnotes and references

  1. https://ieeexplore.ieee.org/document/718718 ↩︎
  2. For visual purposes, I restricted the sample to values of R between -8 and 8. This is because the ratio of uniform distribution has heavy tails. This meant that there were some very large values of R that made the plot hard to see. ↩︎

The discrete arcsine distribution

The discrete arcsine distribution is a probability distribution on \{0,1,\ldots,n\}. It is a u-shaped distribution. There are peaks at 0 and n and a dip in the middle. The figure below shows the probability distribution function for n=10,15, 20.

The probability distribution function of the arcsine distribution is given by

\displaystyle{p_n(k) = \frac{1}{2^{2n}}\binom{2k}{k}\binom{2n-2k)}{n-k}\text{ for } k \in \{0,1,\ldots,n\}}

The discrete arcsine distribution is related to simple random walks and to an interesting Markov chain called the Burnside process. The connection with simple random walks is explained in Chapter 3, Volume 1 of An Introduction to Probability and its applications by William Feller. The connection to the Burnside process was discovered by Persi Diaconis in Analysis of a Bose-Einstein Markov Chain.

The discrete arcsine distribution gets its name from the continuous arcsine distribution. Suppose X_n is distributed according to the discrete arcsine distribution with parameter n. Then the normalized random variables X_n/n converges in distribution to the continuous arcsine distribution on [0,1]. The continuous arcsine distribution has the probability density function

\displaystyle{f(x) = \frac{1}{\pi\sqrt{x(1-x)}}  \text{ for } 0 \le x \le 1}

This means that continuous arcsine distribution is a beta distribution with \alpha=\beta=1/2. It is called the arcsine distribution because the cumulative distribution function involves the arcsine function

\displaystyle{F(x) = \int_0^x f(y)dy = \frac{2}{\pi}\arcsin(\sqrt{x}) \text{ for } 0 \le x \le 1}

There is another connection between the discrete and continuous arcsine distributions. The continuous arcsine distribution can be used to sample the discrete arcsine distribution. The two step procedure below produces a sample from the discrete arcsine distribution with parameter n:

  1. Sample p from the continuous arcsine distribution.
  2. Sample X from the binomial distribution with parameters n and p.

This means that the discrete arcsine distribution is actually the beta-binomial distribution with parameters \alpha = \beta =1/2. I was surprised when I was told this, and couldn’t find a reference. The rest of this blog post proves that the discrete arcsine distribution is an instance of the beta-binomial distribution.

As I showed in this post, the beta-binomial distribution has probability distribution function:

\displaystyle{q_{\alpha,\beta,n}(k) = \binom{n}{k}\frac{B(k+\alpha, n-k+\alpha)}{B(a,b)}},

where B(x,y)=\frac{\Gamma(x)\Gamma(y)}{\Gamma(x+y)} is the Beta-function. To show that the discrete arc sine distribution is an instance of the beta-binomial distribution we need that p_n(k)=q_{1/2,1/2,n}(k). That is

\displaystyle{ \binom{n}{k}\frac{B(k+1/2, n-k+1/2)}{B(1/2,1/2)} = \frac{1}{2^{2n}}\binom{2k}{k}\binom{2n-2k}{n-k}},

for all k = 0,1,\ldots,n. To prove the above equation, we can first do some simplifying to q_{1/2,1/2,n}(k). By definition

\displaystyle{\frac{B(k+1/2, n-k+1/2)}{B(1/2,1/2)} = \frac{\frac{\Gamma(k+1/2)\Gamma(n-k+1/2)}{\Gamma(n+1)}}{\frac{\Gamma(1/2)\Gamma(1/2)}{\Gamma(1)}} = \frac{1}{n!}\frac{\Gamma(k+1/2)}{\Gamma(1/2)}\frac{\Gamma(n-k+1/2)}{\Gamma(1/2)}},

where I have used that \Gamma(m)=(m-1)! factorial if m is a natural number. The Gamma function \Gamma(x) also satisfies the property \Gamma(x+1)=x\Gamma(x). Using this repeatedly gives

\displaystyle{\Gamma(k+1/2) = (k-1/2) \times (k-3/2) \times \cdots \times \frac{3}{2}\times\frac{1}{2}\times\Gamma(1/2) }.

This means that

\displaystyle{\frac{\Gamma(k+1/2)}{\Gamma(1/2)} = (k-1/2) \times (k-3/2) \times \cdots \times \frac{3}{2}\times\frac{1}{2} = \frac{(2k-1)\times(2k-3)\times \cdots \times 3 \times 1}{2^k}=\frac{(2k-1)!!}{2^k}},

where (2k-1)!!=(2k-1)\times (2k-3)\times\cdots \times 3 \times 1 is the double factorial. The same reasoning gives

\displaystyle{\frac{\Gamma(n-k+1/2)}{\Gamma(1/2)} =\frac{(2n-2k-1)!!}{2^{n-k}}}.

And so

\displaystyle{q_{1/2,1/2,n}(k) =\frac{1}{2^nk!(n-k)!}(2k-1)!!(2n-2k-1)!!}.

We’ll now show that p_n(k) is also equal to the above final expression. Recall

\displaystyle{p_n(k) = \frac{1}{2^{2n}} \binom{2k}{k}\binom{2(n-k)}{n-k} = \frac{1}{2^{2n}}\frac{(2k)!(2(n-k))!}{k!k!(n-k)!(n-k)!} = \frac{1}{2^nk!(n-k)!}\frac{(2k)!}{k!2^k}\frac{(2n-2k)!}{(n-k)!2^{n-k}}}.

And so it suffices to show \frac{(2k)!}{k!2^k} = (2k-1)!! (and hence \frac{(2n-2k)!}{(n-k)!2^{n-k}}=(2n-2k-1)!!). To see why this last claim holds, note that

\displaystyle{\frac{(2k)!}{k!2^k} = \frac{(2k)\times (2k-1)\times(2k-2)\times\cdots\times 3 \times 2 \times 1}{(2k)\times (2k-2)\times \cdots \times 2} = (2k-1)!!}

Showing that p_{n}(k)=q_{n,1/2,1/2}(k) as claimed.

The sample size required for importance sampling

My last post was about using importance sampling to estimate the volume of high-dimensional ball. The two figures below compare plain Monte Carlo to using importance sampling with a Gaussian proposal. Both plots use M=1,000 samples to estimate v_n, the volume of an n-dimensional ball

A friend of mine pointed out that the relative error does not seem to increase with the dimension n. He thought it was too good to be true. It turns out he was right and the relative error does increase with dimension but it increases very slowly. To estimate v_n the number of samples needs to grow on the order of \sqrt{n}.

To prove this, we will use the paper The sample size required for importance sampling by Chatterjee and Diaconis [1]. This paper shows that the sample size for importance sampling is determined by the Kullback-Liebler divergence. The relevant result from their paper is Theorem 1.3. This theorem is about the relative error in using importance sampling to estimate a probability.

In our setting the proposal distribution is Q=\mathcal{N}(0,\frac{1}{n}I_n). That is the distribution Q is an n-dimensional Gaussian vector with mean 0 and covariance \frac{1}{n}I_n. The conditional target distribution is P the uniform distribution on the n dimensional ball. Theorem 1.3 in [1] tells us how many samples are needed to estimate v_n. Informally, the required sample size is M = O(\exp(D(P \Vert Q))). Here D(P\Vert Q) is the Kullback-Liebler divergence between P and Q.

To use this theorem we need to compute D(P \Vert Q). Kullback-Liebler divergence is defined as integral. Specifically

\displaystyle{D(P\Vert Q) = \int_{\mathbb{R}^n} \log\frac{P(x)}{Q(x)}P(x)dx}

Computing the high-dimensional integral above looks challenging. Fortunately, it can reduced to a one-dimensional integral. This is because both the distributions P and Q are rotationally symmetric. To use this, define P_r,Q_r to be the distribution of the norm squared under P and Q. That is if X \sim P, then \Vert X \Vert_2^2 \sim P_R and likewise for Q_R. By the rotational symmetry of P and Q we have

D(P\Vert Q) = D(P_R \Vert Q_R).

We can work out both P_R and Q_R. The distribution P is the uniform distribution on the n-dimensional ball. And so for X \sim P and any r \in [0,1]

\mathbb{P}(\Vert X \Vert_2^2 \le r) = \frac{v_n r^n}{v_n} = r^n.

Which implies that P_R has density P_R(r)=nr^{n-1}. This means that P_R is a Beta distribution with parameters \alpha = n, \beta = 1. The distribution Q is a multivariate Gaussian distribution with mean 0 and variance \frac{1}{n}I_n. This means that if X \sim Q, then \Vert X \Vert_2^2 = \sum_{i=1}^n X_i^2 is a scaled chi-squared variable. The shape parameter of Q_R is n and scale parameter is 1/n. The density for Q_R is therefor

Q_R(r) = \frac{n^{n/2}}{2^{n/2}\Gamma(n/2)}r^{n/2-1}e^{-nx/2}

The Kullback-Leibler divergence between P and Q is therefor

\displaystyle{D(P\Vert Q)=D(P_R\Vert Q_R) = \int_0^1 \log \frac{P_R(r)}{Q_R(r)} P_R(r)dr}

Getting Mathematica to do the above integral gives

D(P \Vert Q) = -\frac{1+2n}{2+2n} + \frac{n}{2}\log(2 e) - (1-\frac{n}{2})\log n + \log \Gamma(\frac{n}{2}).

Using the approximation \log \Gamma(z) \approx (z-\frac{1}{2})\log(z)-z+O(1) we get that for large n

D(P \Vert Q) = \frac{1}{2}\log n + O(1).

And so the required number of samples is O(\exp(D(P \Vert Q)) = O(\sqrt{n}).

[1] Chatterjee, Sourav, and Persi Diaconis. “THE SAMPLE SIZE REQUIRED IN IMPORTANCE SAMPLING.” The Annals of Applied Probability 28, no. 2 (2018): 1099–1135. https://www.jstor.org/stable/26542331. (Public preprint here https://arxiv.org/abs/1511.01437)

Auxiliary variables sampler

The auxiliary variables sampler is a general Markov chain Monte Carlo (MCMC) technique for sampling from probability distributions with unknown normalizing constants [1, Section 3.1]. Specifically, suppose we have n functions f_i : \mathcal{X} \to (0,\infty) and we want to sample from the probability distribution P(x) \propto \prod_{i=1}^n f_i(x). That is

\displaystyle{ P(x) = \frac{1}{Z}\prod_{i=1}^n f_i(x)},

where Z = \sum_{x \in \mathcal{X}} \prod_{i=1}^n f_i(x) is a normalizing constant. If the set \mathcal{X} is very large, then it may be difficult to compute Z or sample from P(x). To approximately sample from P(x) we can run an ergodic Markov chain with P(x) as a stationary distribution. Adding auxiliary variables is one way to create such a Markov chain. For each i \in \{1,\ldots,n\}, we add a auxiliary variable U_i \ge 0 such that

\displaystyle{P(U \mid X) = \prod_{i=1}^n \mathrm{Unif}[0,f_i(X)]}.

That is, conditional on X, the auxiliary variables U_1,\ldots,U_n are independent and U_i is uniformly distributed on the interval [0,f_i(X)]. If X is distributed according to P and U_1,\ldots,U_n have the above auxiliary variable distribution, then

\displaystyle{P(X,U) =P(X)P(U\mid X)\propto  \prod_{i=1}^n f_i(X) \frac{1}{f_i(X)} I[U_i \le f(X_i)] = \prod_{i=1}^n I[U_i \le f(X_i)]}.

This means that the joint distribution of (X,U) is uniform on the set

\widetilde{\mathcal{X}} = \{(x,u) \in \mathcal{X} \times (0,\infty)^n : u_i \le f(x_i) \text{ for all } i\}.

Put another way, suppose we could jointly sample (X,U) from the uniform distribution on \widetilde{\mathcal{X}}. Then, the above calculation shows that if we discard U and only keep X, then X will be sampled from our target distribution P.

The auxiliary variables sampler approximately samples from the uniform distribution on \widetilde{\mathcal{X}} is by using a Gibbs sampler. The Gibbs samplers alternates between sampling U' from P(U \mid X) and then sampling X' from P(X \mid U'). Since the joint distribution of (X,U) is uniform on \widetilde{\mathcal{X}}, the conditional distributions P(X \mid U) and P(U \mid X) are also uniform. The auxiliary variables sampler thus transitions from X to X' according to the two step process

  1. Independently sample U_i uniformly from [0,f_i(X)].
  2. Sample X' uniformly from the set \{x \in \mathcal{X} : f_i(x) \ge u_i \text{ for all } i\}.

Since the auxiliary variables sampler is based on a Gibbs sampler, we know that the joint distribution of (X,U) will converge to the uniform distribution on \mathcal{X}. So when we discard U, the distribution of X will converge to the target distribution P!

Auxiliary variables in practice

To perform step 2 of the auxiliary variables sampler we have to be able to sample uniformly from the sets

\mathcal{X}_u = \{x \in \mathcal{X}:f_i(x) \ge u_i \text{ for all } i\}.

Depending on the nature of the set \mathcal{X} and the functions f_i, it might be difficult to do this. Fortunately, there are some notable examples where this step has been worked out. The very first example of auxiliary variables is the Swendsen-Wang algorithm for sampling from the Ising model [2]. In this model it is possible to sample uniformly from \mathcal{X}_u. Another setting where we can sample exactly is when \mathcal{X} is the real numbers \mathcal{R} and each f_i is an increasing function of x. This is explored in [3] where they apply auxiliary variables to sampling from Bayesian posteriors.

There is an alternative to sampling exactly from the uniform distribution on \mathcal{X}_u. Instead of sampling X' uniformly from \mathcal{X}_u, we can run a Markov chain from the old X that has the uniform distribution as a stationary distribution. This approach leads to another special case of auxiliary variables which is called “slice sampling” [4].

References

[1] Andersen HC, Diaconis P. Hit and run as a unifying device. Journal de la societe francaise de statistique & revue de statistique appliquee. 2007;148(4):5-28. http://www.numdam.org/item/JSFS_2007__148_4_5_0/
[2] Swendsen RH, Wang JS. Nonuniversal critical dynamics in Monte Carlo simulations. Physical review letters. 1987 Jan 12;58(2):86. https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.58.86
[3] Damlen P, Wakefield J, Walker S. Gibbs sampling for Bayesian non‐conjugate and hierarchical models by using auxiliary variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 1999 Apr;61(2):331-44. https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/1467-9868.00179
[4] Neal RM. Slice sampling. The annals of statistics. 2003 Jun;31(3):705-67. https://projecteuclid.org/journals/annals-of-statistics/volume-31/issue-3/Slice-sampling/10.1214/aos/1056562461.full

Log-sum-exp trick with negative numbers

Suppose you want to calculate an expression of the form

\displaystyle{\log\left(\sum_{i=1}^n \exp(a_i) - \sum_{j=1}^m \exp(b_j)\right)},

where \sum_{i=1}^n \exp(a_i) > \sum_{j=1}^m \exp(b_j). Such expressions can be difficult to evaluate directly since the exponentials can easily cause overflow errors. In this post, I’ll talk about a clever way to avoid such errors.

If there were no terms in the second sum we could use the log-sum-exp trick. That is, to calculate

\displaystyle{\log\left(\sum_{i=1}^n \exp(a_i) \right)},

we set a=\max\{a_i : 1 \le i \le n\} and use the identity

\displaystyle{a + \log\left(\sum_{i=1}^n \exp(a_i-a)\right) = \log\left(\sum_{i=1}^n \exp(a_i)\right)}.

Since a_i \le a for all i =1,\ldots,n, the left hand side of the above equation can be computed without the risk of overflow. To calculate,

\displaystyle{\log\left(\sum_{i=1}^n \exp(a_i) - \sum_{j=1}^m \exp(b_j)\right)},

we can use the above method to separately calculate

\displaystyle{A = \log\left(\sum_{i=1}^n \exp(a_i) \right)} and \displaystyle{B = \log\left(\sum_{j=1}^m \exp(b_j)\right)}.

The final result we want is

\displaystyle{\log\left(\sum_{i=1}^n \exp(a_i) - \sum_{j=1}^m \exp(b_j) \right) = \log(\exp(A) - \exp(B)) = A + \log(1-\exp(B-A))}.

Since $A > B$, the right hand side of the above expression can be evaluated safely and we will have our final answer.

R code

The R code below defines a function that performs the above procedure

# Safely compute log(sum(exp(pos)) - sum(exp(neg)))
# The default value for neg is an empty vector.

logSumExp <- function(pos, neg = c()){
  max_pos <- max(pos)
  A <- max_pos + log(sum(exp(pos - max_pos)))
  
  # If neg is empty, the calculation is done
  if (length(neg) == 0){
    return(A)
  }
  
  # If neg is non-empty, calculate B.
  max_neg <- max(neg)
  B <- max_neg + log(sum(exp(neg - max_neg)))
  
  # Check that A is bigger than B
  if (A <= B) {
    stop("sum(exp(pos)) must be larger than sum(exp(neg))")
  }
  
  # log1p() is a built in function that accurately calculates log(1+x) for |x| << 1
  return(A + log1p(-exp(B - A)))
}

An example

The above procedure can be used to evaulate

\displaystyle{\log\left(\sum_{i=1}^{200} i! - \sum_{i=1}^{500} \binom{500}{i}^2\right)}.

Evaluating this directly would quickly lead to errors since R (and most other programming languages) cannot compute 200!. However, R has the functions lfactorial() and lchoose() which can compute \log(i!) and \log\binom{n}{i} for large values of i and n. We can thus put this expression in the general form at the start of this post

\displaystyle{\log\left(\sum_{i=1}^{200} \exp(\log(i!)) - \sum_{i=1}^{500} \exp\left(2\log\binom{500}{i}\right)\right)}.

The following R code thus us exactly what we want:

pos <- sapply(1:200, lfactorial)
neg <- sapply(1:500, function(i){2*lchoose(500, i)})

logSumExp(pos, neg)
# 863.237

The beta-binomial distribution

The beta-binomial model is a Bayesian model used to analyze rates. For a great derivation and explanation of this model, I highly recommend watching the second lecture from Richard McElreath’s course Statistical Rethinking. In this model, the data, X, is assumed to be binomially distributed with a fixed number of trail N but an unknown rate \rho \in [0,1]. The rate \rho is given a \text{Beta}(a,b) prior. That is the prior distribution of \rho has a density

p(\rho) = \frac{1}{B(a,b)} \rho^{a-1}(1-\rho)^{b-1},

where B(a,b) =\int_0^1 \rho^{a-1}(1-\rho)^{b-1}d\rho is a normalizing constant. The model can thus be written as

\rho \sim \text{Beta}(a,b),
X | \rho \sim \text{Binom}(N,\rho).

This is a conjugate model, meaning that the posterior distribution of \rho is again a beta distribution. This can be seen by using Bayes rule

p(\rho | X) \propto p(X| \rho)p(\rho) \propto \rho^X(1-\rho)^{N-X}\rho^{a-1}(1-\rho)^{b-1}=\rho^{X+a-1}(1-\rho)^{(N-X)+b-1}.

The last expression is proportional to a beta density., specifically \rho | X \sim \text{Beta}(X+a, N-X+b).

The marginal distribution of X

In the above model we are given the distribution of \rho and the conditional distribution of X|\rho. To calculate the distribution of X, we thus need to marginalize over \rho. Specifically,

\displaystyle{p(X) = \int_0^1 p(X,\rho)d\rho = \int_0^1 p(X| \rho)p(\rho)d\rho.}

The term inside the above integral is

\displaystyle{p(X| \rho)p(\rho) = \binom{N}{X}\rho^X(1-\rho)^{N-X}\frac{1}{B(a,b)}\rho^{a-1}(1-\rho)^{b-1} = \frac{\binom{N}{X}}{B(a,b)}\rho^{X+a-1}(1-\rho)^{N-X+b-1} }.

Thus,

\displaystyle{p(X) = \frac{\binom{N}{X}}{B(a,b)} \int_0^1 \rho^{X+a-1}(1-\rho)^{N-X+b-1}d\rho = \binom{N}{X}\frac{B(X+a, N-X+a)}{B(a,b)}}.

This distribution is called the beta-binomial distribution. Below is an image from Wikipedia showing a graph of p(X) for N=10 and a number of different values of a and b. You can see that, especially for small value of a and b the distribution is a lot more spread out than the binomial distribution. This is because there is randomness coming from both \rho and the binomial conditional distribution.

A plot of the beta-binomial distribution for different values of the parameters a and b. For small values of a and b, the distribution is very spread out.

Two sample tests as correlation tests

Suppose we have two samples Y_1^{(0)}, Y_2^{(0)},\ldots, Y_{n_0}^{(0)} and Y_1^{(1)},Y_2^{(1)},\ldots, Y_{n_1}^{(1)} and we want to test if they are from the same distribution. Many popular tests can be reinterpreted as correlation tests by pooling the two samples and introducing a dummy variable that encodes which sample each data point comes from. In this post we will see how this plays out in a simple t-test.

The equal variance t-test

In the equal variance t-test, we assume that Y_i^{(0)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_0,\sigma^2) and Y_i^{(1)} \stackrel{\text{iid}}{\sim}  \mathcal{N}(\mu_1,\sigma^2), where \sigma^2 is unknown. Our hypothesis that Y_1^{(0)}, Y_2^{(0)},\ldots, Y_{n_0}^{(0)} and Y_1^{(1)},Y_2^{(1)},\ldots, Y_{n_1}^{(1)} are from the same distribution becomes the hypothesis \mu_0 = \mu_1. The test statistic is

t = \frac{\displaystyle \overline{Y}^{(1)} - \overline{Y}^{(0)}}{\displaystyle \hat{\sigma}\sqrt{\frac{1}{n_0}+\frac{1}{n_1}}},

where \overline{Y}^{(0)} and \overline{Y}^{(1)} are the two sample means. The variable \hat{\sigma} is the pooled estimate of the standard deviation and is given by

\hat{\sigma}^2 = \displaystyle\frac{1}{n_0+n_1-2}\left(\sum_{i=1}^{n_0}\left(Y_i^{(0)}-\overline{Y}^{(0)}\right)^2 + \sum_{i=1}^{n_1}\left(Y_i^{(1)}-\overline{Y}^{(1)}\right)^2\right).

Under the null hypothesis, t follows the T-distribution with n_0+n_1-2 degrees of freedom. We thus reject the null \mu_0=\mu_1 when |t| exceeds the 1-\alpha/2 quantile of the T-distribution.

Pooling the data

We can turn this two sample test into a correlation test by pooling the data and using a linear model. Let Y_1,\ldots,Y_{n_0}, Y_{n_0+1},\ldots,Y_{n_0+n_1} be the pooled data and for i = 1,\ldots, n_0+n_1, define x_i \in \{0,1\} by

x_i = \begin{cases} 0 & \text{if } 1 \le i \le n_0,\\ 1 & \text{if } n_0+1 \le i \le n_0+n_1.\end{cases}

The assumptions that Y_i^{(0)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_0,\sigma^2) and Y_i^{(1)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_1,\sigma^2) can be rewritten as

Y_i = \beta_0+\beta_1x_i + \varepsilon_i,

where \varepsilon_i \stackrel{\text{iid}}{\sim} \mathcal{N}(0,\sigma^2). That is, we have expressed our modelling assumptions as a linear model. When working with this linear model, the hypothesis \mu_0 = \mu_1 is equivalent to \beta_1 = 0. To test \beta_1 = 0 we can use the standard t-test for a coefficient in linear model. The test statistic in this case is

t' = \displaystyle\frac{\hat{\beta}_1}{\hat{\sigma}_{OLS}\sqrt{(X^TX)^{-1}_{11}}},

where \hat{\beta}_1 is the ordinary least squares estimate of \beta_1, X \in \mathbb{R}^{(n_0+n_1)\times 2} is the design matrix and \hat{\sigma}_{OLS} is an estimate of \sigma given by

\hat{\sigma}_{OLS}^2 = \displaystyle\frac{1}{n_0+n_1-2}\sum_{i=1}^{n_0+n_1} (Y_i-\hat{Y}_i)^2,

where \hat{Y} = \hat{\beta}_0+\hat{\beta}_1x_i is the fitted value of Y_i.

It turns out that t' is exactly equal to t. We can see this by writing out the design matrix and calculating everything above. The design matrix has rows [1,x_i] and is thus equal to

X = \begin{bmatrix} 1&x_1\\ 1&x_2\\  \vdots&\vdots\\ 1&x_{n_0}\\  1&x_{n_0+1}\\ \vdots&\vdots\\ 1&x_{n_0+n_1}\end{bmatrix} = \begin{bmatrix} 1&0\\ 1&0\\  \vdots&\vdots\\ 1&0\\  1&1\\ \vdots&\vdots\\ 1&1\end{bmatrix}.

This implies that

X^TX = \begin{bmatrix} n_0+n_1 &n_1\\n_1&n_1 \end{bmatrix},

And therefore,

(X^TX)^{-1} = \frac{1}{(n_0+n_1)n_1-n_1^2}\begin{bmatrix} n_1 &-n_1\\-n_1&n_0+n_1 \end{bmatrix} = \frac{1}{n_0n_1}\begin{bmatrix} n_1&-n_1\\-n_1&n_0+n_1\end{bmatrix} =\begin{bmatrix} \frac{1}{n_0}&-\frac{1}{n_0}\\-\frac{1}{n_0}&\frac{1}{n_0}+\frac{1}{n_1}\end{bmatrix} .

Thus, (X^TX)^{-1}_{11} = \frac{1}{n_0}+\frac{1}{n_1}. So,

t' = \displaystyle\frac{\hat{\beta}_1}{\hat{\sigma}_{OLS}\sqrt{\frac{1}{n_0}+\frac{1}{n_1}}},

which is starting to like t from the two-sample test. Now

X^TY = \begin{bmatrix} \displaystyle\sum_{i=1}^{n_0+n_1} Y_i\\ \displaystyle \sum_{i=n_0+1}^{n_0+n_1} Y_i \end{bmatrix} = \begin{bmatrix} n_0\overline{Y}^{(0)} + n_1\overline{Y}^{(1)}\\  n_1\overline{Y}^{(1)} \end{bmatrix}.

And so

\hat{\beta} = (X^TX)^{-1}X^TY = \begin{bmatrix} \frac{1}{n_0}&-\frac{1}{n_0}\\-\frac{1}{n_0}&\frac{1}{n_0}+\frac{1}{n_1}\end{bmatrix}\begin{bmatrix} n_0\overline{Y}^{(0)} + n_1\overline{Y}^{(1)}\\  n_1\overline{Y}^{(1)} \end{bmatrix}=\begin{bmatrix} \overline{Y}^{(0)}\\  \overline{Y}^{(1)} -\overline{Y}^{(0)}\end{bmatrix}.

Thus, \hat{\beta}_1 = \overline{Y}^{(1)} -\overline{Y}^{(0)} and

t' = \displaystyle\frac{\overline{Y}^{(1)}-\overline{Y}^{(0)}}{\hat{\sigma}_{OLS}\sqrt{\frac{1}{n_0}+\frac{1}{n_1}}}.

This means to show that t' = t, we only need to show that \hat{\sigma}_{OLS}^2=\hat{\sigma}^2. To do this, note that the fitted values \hat{Y} are equal to

\displaystyle\hat{Y}_i=\hat{\beta}_0+x_i\hat{\beta}_1 = \begin{cases} \overline{Y}^{(0)} & \text{if } 1 \le i \le n_0,\\ \overline{Y}^{(1)} & \text{if } n_0+1\le i \le n_0+n_1\end{cases}.

Thus,

\hat{\sigma}^2_{OLS} = \displaystyle\frac{1}{n_0+n_1-2}\sum_{i=1}^{n_0+n_1}\left(Y_i-\hat{Y}_i\right)^2=\displaystyle\frac{1}{n_0+n_1-2}\left(\sum_{i=1}^{n_0}\left(Y_i^{(0)}-\overline{Y}^{(0)}\right)^2 + \sum_{i=1}^{n_1}\left(Y_i^{(1)}-\overline{Y}^{(1)}\right)^2\right),

Which is exactly \hat{\sigma}^2. Therefore, t'=t and the two sample t-test is equivalent to a correlation test.

The Friedman-Rafsky test

In the above example, we saw that the two sample t-test was a special case of the t-test for regressions. This is neat but both tests make very strong assumptions about the data. However, the same thing happens in a more interesting non-parametric setting.

In their 1979 paper, Jerome Friedman and Lawrence Rafsky introduced a two sample tests that makes no assumptions about the distribution of the data. The two samples do not even have to real-valued and can instead be from any metric space. It turns out that their test is a special case of another procedure they devised for testing for association (Friedman & Rafsky, 1983). As with the t-tests above, this connection comes from pooling the two samples and introducing a dummy variable.

I plan to write a follow up post explaining these procedures but you can also read about it in Chapter 6 of Group Representations in Probability and Statistics by Persi Diaconis.

References

Persi Diaconis “Group representations in probability and statistics,” pp 104-106, Hayward, CA: Institute of Mathematical Statistics, (1988)

Jerome H. Friedman, Lawrence C. Rafsky “Multivariate Generalizations of the Wald-Wolfowitz and Smirnov Two-Sample Tests,” The Annals of Statistics, Ann. Statist. 7(4), 697-717, (July, 1979)

Jerome H. Friedman, Lawrence C. Rafsky “Graph-Theoretic Measures of Multivariate Association and Prediction,” The Annals of Statistics, Ann. Statist. 11(2), 377-391, (June, 1983).

MCMC for hypothesis testing

A Monte Carlo significance test of the null hypothesis X_0 \sim H requires creating independent samples X_1,\ldots,X_B \sim H. The idea is if X_0 \sim H and independently X_1,\ldots,X_B are i.i.d. from H, then for any test statistic T, the rank of T(X_0) among T(X_0), T(X_1),\ldots, T(X_B) is uniformly distributed on \{1,\ldots, B+1\}. This means that if T(X_0) is one of the k largest values of T(X_b), then we can reject the hypothesis X \sim H at the significance level k/(B+1).

The advantage of Monte Carlo significance tests is that we do not need an analytic expression for the distribution of T(X_0) under X_0 \sim H. By generating the i.i.d. samples X_1,\ldots,X_B, we are making an empirical distribution that approximates the theoretical distribution. However, sometimes sampling X_1,\ldots, X_B is just as intractable as theoretically studying the distribution of T(X_0) . Often approximate samples based on Markov chain Monte Carlo (MCMC) are used instead. However, these samples are not independent and may not be sampling from the true distribution. This means that a test using MCMC may not be statistically valid

In the 1989 paper Generalized Monte Carlo significance tests, Besag and Clifford propose two methods that solve this exact problem. Their two methods can be used in the same settings where MCMC is used but they are statistically valid and correctly control the Type 1 error. In this post, I will describe just one of the methods – the serial test.

Background on Markov chains

To describe the serial test we will need to introduce some notation. Let P denote a transition matrix for a Markov chain on a discrete state space \mathcal{X}. A Markov chain with transition matrix P thus satisfies,

\mathbb{P}(X_n = x_n \mid X_0 = x_0,\ldots, X_{n-1} = x_{n-1}) = P(x_{n-1}, x_n)

Suppose that the transition matrix P has a stationary distribution \pi. This means that if (X_0, X_1,\ldots) is a Markov chain with transition matrix P and X_0 is distributed according to \pi, then X_1 is also distributed according to \pi. This implies that all X_i are distributed according to \pi.

We can construct a new transition matrix Q from P and \pi by Q(y,x) = P(x,y) \frac{\pi(x)}{\pi(y)}. The transition matrix Q is called the reversal of P. This is because for all x and y in \mathcal{X}, \pi(x)P(x,y) = \pi(y)Q(x,y). That is the chance of drawing x from \pi and then transitioning to y according to P is equal to the chance of drawing y from \pi and then transitioning to x according to Q

The new transition matrix Q also allows us to reverse longer runs of the Markov chain. Fix n \ge 0 and let (X_0,X_1,\ldots,X_n) be a Markov chain with transition matrix P and initial distribution \pi. Also let (Y_0,Y_1,\ldots,Y_n) be a Markov chain with transition matrix Q and initial distribution \pi, then

(X_0,X_1,\ldots,X_{n-1}, X_n) \stackrel{dist}{=} (Y_n,Y_{n-1},\ldots, Y_1,Y_0) ,

where \stackrel{dist}{=} means equal in distribution.

The serial test

Suppose we want to test the hypothesis x \sim \pi where x \in \mathcal{X} is our observed data and \pi is some distribution on \mathcal{X}. To conduct the serial test, we need to construct a Markov chain P for which \pi is a stationary distribution. We then also need to construct the reversal Q described above. There are many possible ways to construct P such as the Metropolis-Hastings algorithm.

We also need a test statistic T. This is a function T:\mathcal{X} \to \mathbb{R} which we will use to detect outliers. This function is the same function we would use in a regular Monte Carlo test. Namely, we want to reject the null hypothesis when T(x) is much larger than we would expect under x \sim \pi.

The serial test then proceeds as follows. First we pick \xi uniformly in \{0,1,\ldots,B\} and set X_\xi = x. We then generate (X_\xi, X_{\xi+1},\ldots,X_B) as a Markov chain with transition matrix P that starts at X_\xi = x. Likewise we generate (X_\xi, X_{\xi-1},\ldots, X_0) as a Markov chain that starts from Q.

We then apply T to each of (X_0,X_1,\ldots,X_\xi,\ldots,X_B) and count the number of b \in \{0,1,\ldots,B\} such that T(X_b) \ge T(X_\xi) = T(x). If there are k such b, then the reported p-value of our test is k/(B+1).

We will now show that this test produces a valid p-value. That is, when x \sim \pi, the probability that k/(B+1) is less than \alpha is at most \alpha. In symbols,

\mathbb{P}\left( \frac{\#\{b: T(X_b) \ge T(X_\xi)\}}{B+1} \le \alpha \right) \le \alpha

Under the null hypothesis x \sim \pi, (X_\xi, X_{\xi-1},\ldots, X_0) is equal in distribution to generating X_0 from \pi and using the transition matrix P to go from X_i to X_{i+1}.

Thus, under the null hypothesis, the distribution of (X_0,X_1,\ldots,X_B) does not depend on \xi. The entire procedure is equivalent to generating a Markov chain \mathbf{X} = (X_0,X_1,\ldots,X_B) with initial distribution \pi and transition matrix P, and then choosing \xi \in \{0,1,\ldots,B\} independently of \mathbf{X}. This is enough to show that the serial method produces valid p-values. The idea is that since the distribution of \mathbf{X} does not depend on \xi and \xi is uniformly distributed on \{0,1,\ldots,B\}, the probability that T(X_\xi) is in the top \alpha proportion of T(X_b) should be at most \alpha. This is proved more formally below.

For each i \in \{0,1,\ldots,B\}, let A_i be the event that T(X_i) is in the top \alpha proportion of T(X_0),\ldots,T(X_B). That is,

A_i = \left\{ \frac{\# \{b : T(X_b) \ge T(X_i)\} }{B+1} \le \alpha \right\}.

Let I_{A_i} be the indicator function for the event A_i. Since at must \alpha(B+1) values of i can be in the top \alpha fraction of T(X_0),\ldots,T(X_B), we have that

\sum_{i=0}^B I_{A_i} \le \alpha(B+1),

Therefor, by linearity of expectations,

\sum_{i=0}^B \mathbb{P}(A_i) \le \alpha(B+1)

By the law of total probability we have,

\mathbb{P}\left(\frac{\#\{b: T(X_b) \ge T(X_\xi)\}}{B+1} \le \alpha \right) = \sum_{i=0}^B \mathbb{P}\left(\frac{\#\{b: T(X_b) \ge T(X_\xi)\}}{B+1} \le \alpha | \xi = i\right)\mathbb{P}(\xi = i),

Since \xi is uniform on \{0,\ldots,B\}, \mathbb{P}(\xi = i) is \frac{1}{B+1} for all i. Furthermore, by independence of \xi and \mathbf{X}, we have

\mathbb{P}\left(\frac{\#\{b: T(X_b) \ge T(X_\xi)\}}{B+1} \le \alpha | \xi = i\right) = \mathbb{P}\left(\frac{\#\{b: T(X_b) \ge T(X_i)\}}{B+1} \le \alpha | \xi = i\right) = \mathbb{P}(A_i).

Thus, by our previous bound,

\mathbb{P}\left(\frac{\#\{b: T(X_b) \ge T(X_\xi)\}}{B+1} \le \alpha \right) = \frac{1}{B+1}\sum_{i=0}^B \mathbb{P}(A_i) \le \alpha.

Applications

In the original paper by Besag and Clifford, the authors discuss how this procedure can be used to perform goodness-of-fit tests. They construct Markov chains that can test the Rasch model or the Ising model. More broadly the method can be used to tests goodness-of-fit tests for any exponential family by using the Markov chains developed by Diaconis and Sturmfels.

A similar method has also been applied more recently to detect Gerrymandering. In this setting, the null hypothesis is the uniform distribution on all valid redistrictings of a state and the test statistics T measure the political advantage of a given districting.

References

  1. Besag, Julian, and Peter Clifford. “Generalized Monte Carlo Significance Tests.” Biometrika 76, no. 4 (1989)
  2. Persi Diaconis, Bernd Sturmfels “Algebraic algorithms for sampling from conditional distributions,” The Annals of Statistics, Ann. Statist. 26(1), 363-397, (1998)
  3. Chikina, Maria et al. “Assessing significance in a Markov chain without mixing.” Proceedings of the National Academy of Sciences of the United States of America vol. 114,11 (2017)