Poisson approximations to the negative binomial distribution

This post is an introduction to the negative binomial distribution and a discussion of different ways of approximating the negative binomial distribution.

The negative binomial distribution describes the number of times a coin lands on tails before a certain number of heads are recorded. The distribution depends on two parameters p and r. The parameter p is the probability that the coin lands on heads and r is the number of heads. If X has the negative binomial distribution, then X = x means in the first x+r-1 tosses of the coin, there were r-1 heads and that toss number x+r was a head. This means that the probability that X=x is given by

\displaystyle{f(x) = \binom{x+r-1}{r-1}p^{r}\left(1-p\right)^x}

Here is a plot of the function f(x) for different values of r and p.

Poisson approximations

When the parameter r is large and p is close to one, the negative binomial distribution can be approximated by a Poisson distribution. More formally, suppose that r(1-p)=\lambda for some positive real number \lambda. If r is large then, the negative binomial random variable with parameters p and r, converges to a Poisson random variable with parameter \lambda. This is illustrated in the picture below where three negative binomial distributions with r(1-p)=5 approach the Poisson distribution with \lambda =5.

Total variation distance is a common way to measure the distance between two discrete probability distributions. The log-log plot below shows that the error from the Poisson approximation is on the order of 1/r and that the error is bigger if the limiting value of r(1-p) is larger.

It turns out that is is possible to get a more accurate approximation by using a different Poisson distribution. In the first approximation, we used a Poisson random variable with mean \lambda = r(1-p). However, the mean of the negative binomial distribution is r(1-p)/p. This suggests that we can get a better approximation by setting \lambda = r(1-p)/p.

The change from \lambda = r(1-p) to \lambda = r(1-p)/p is a small because p \approx 1. However, this small change gives a much better approximation, especially for larger values of r(1-p). The below plot shows that both approximations have errors on the order of 1/r, but the constant for the second approximation is much better.

Second order accurate approximation

It is possible to further improve the Poisson approximation by using a Gram–Charlier expansion. A Gram–Charlier approximation for the Poisson distribution is given in this paper.1 The approximation is

\displaystyle{f_{GC}(x) = P_\lambda(x) - \frac{1}{2}(1-p)\left((x-\lambda)P_\lambda(x)-(x-1-\lambda)P_\lambda(x-1)\right)},

where \lambda = \frac{k(1-p)}{p} as in the second Poisson approximation and P_\lambda(x) is the Poisson pmf evaluated at x.

The Gram–Charlier expansion is considerably more accurate than either Poisson approximation. The errors are on the order of 1/r^2. This higher accuracy means that the error curves for the Gram–Charlier expansion has a steeper slope.

  1. The approximation is given in equation (4) of the paper and is stated in terms of the CDF instead of the PMF. The equation also contains a small typo, it should say \frac{1}{2}q instead of \frac{1]{2}p. ↩︎

Jigsaw puzzles and the isoperimetric inequality

The 2025 World Jigsaw Puzzle Championship recently took place in Spain. In the individual heats, competitors had to choose their puzzle. Each competitor could either complete a 500-piece rectangular puzzle or a 500-piece circular puzzle. For example, in Round F, puzzlers could choose one of the following two puzzles:

There are many things to consider when picking a puzzle. First, the puzzles had already been released. This meant a puzzler could get an advantage by choosing to do a puzzle they had already completed. The image on the puzzle is also important, as is the puzzle’s style. In the example above, most puzzlers picked the circular puzzle because it had a clear colour gradient and distinct sections.

The commentators mentioned that puzzlers might also consider the number of edge pieces. The circular puzzles have fewer edge pieces than the rectangular puzzles. This could make the rectangular puzzle easier because some puzzlers like to do the edge first and then build inwards.

The difference in edge pieces isn’t a coincidence. Circular puzzles have fewer edge pieces than any other 500-piece puzzle. This fact is a consequence of a piece of mathematics called the isoperimetric inequality.

The isopermetric inequality

The isoperimetric inequality is a statement about 2D shapes. The word ‘isoperimetric’ literally translates to ‘having the same perimeter.’ The isoperimetric inequality says that if a shape has the same perimeter as a particular circle, then the circle has a larger (or equal) area. For example, the square and circle below have the same perimeter but the circle has a larger area:

The isoperimetric inequality can also be flipped around to give a statement about shapes with the same area. This ‘having the same area’ inequality states that if a shape has the same area as a circle, then the circle has a smaller (or equal) perimeter. For example, the square and circle below have the same area but the circle has a smaller perimeter:

The ‘having the same area’ inequality applies to the jigsaw puzzles above. In a jigsaw puzzle, the area is proportional to the number of pieces and the perimeter is proportional to the number of edge pieces (assuming the pieces are all roughly the same size and shape). This means that if one puzzle has the same number of pieces as a circular puzzle (for example 500), then the circular puzzle will have fewer edge pieces! This is exactly what the commentator said about the puzzles in the competition!

I don’t think many of the puzzlers thought about the isopermetric inequality when picking their puzzles. But if you ever do a circular puzzle, remember that you’re doing a puzzle with as few edge pieces as possible!

The Multivariate Hypergeometric Distribution

The hypergeometric distribution describes the number of white balls in a sample of n balls draw without replacement from an urn with m_W white balls and m_B black balls. The probability of having x white balls in the sample is:

\displaystyle{f(x; m_W, m_B, n) = \frac{\binom{m_W}{x} \binom{m_B}{n-x}}{\binom{m_W+m_B}{n}}}

This is because there are \binom{m_W+m_B}{n} possible ways to sample n balls from the urn which has m_W+m_B black balls in total. Of these possibilities, there are \binom{m_W}{x} \binom{m_B}{n-x} ways of selecting exactly x white balls.

The multivariate hypergeometric distribution is a generalization of the hypergeometric distribution. In the multivariate hypergeometric distribution, there are k colours instead of 2. Suppose that the urn contains m_1 balls of the first colour, m_2 balls of the second colour and so on. The total number of balls in the urn is M = m_1+m_2+\cdots + m_k. Next we draw n balls from the urn. Let X_i be the number of balls of colour i in our sample. The vector X=(X_1,X_2,\ldots, X_k) follows the multivariate hypergeometric distribution. The vector X satisfies the constraints X_i \in \{0,1,\ldots,M_i\} and \sum_{i=1}^k X_i = n. The probability of observing X=x is

\displaystyle{f(x; m, n) = \frac{\binom{m_1}{x_1} \binom{m_2}{x_2}\cdots \binom{m_k}{x_k}}{\binom{M}{n} }, \quad x_1+x_2+\cdots+x_k = n}

This is because there are \binom{M}{n} possible ways of sampling n balls from the urn that has M balls in total. Likewise, The number of ways of choosing x_i balls of colour m_i is \binom{m_i}{x_i}.

Sampling

There are two common methods for sampling from the multivariate hypergeometric distribution. The first option is to sample n numbers from 1,\ldots,M without replacement. The numbers 1,\ldots,M represent the balls in the urn. We can let 1,\ldots,m_1 represent the numbers with colour 1, and let m_1+1,\ldots,m_1+m_2 represent the numbers with colour 2 and so on. Suppose that Y_1,\ldots,Y_n is the sample of size n without replacement from 1,\ldots,M. Next define:

\displaystyle{X_i = \#\{j : X_1+X_2+\cdots+X_{i-1} < Y_j \le X_1+X_2+\cdots+X_{i-1}+X_i\}}

Then the vector X=(X_1,\ldots,X_k) will be distributed according to the multivariate hypergeometric distribution. This method of sampling has complexity on the order of n\log(M). This is because sampling a random number of size M has complexity \log(M) and we need to sample n of these numbers.

There is a second method of sampling that is more efficient when n is large and k is small. This is a sequential method that samples each X_i from a (univariate) hypergeometric distribution. The idea is that when sampling X_1 we can treat the balls of colour 1 as “white” and all the other colours as “black”. This means that marginally, X_1 is hypergeometric with parameters m_W = m_1 and m_B = m_2+\cdots + m_k and n = n. Furthermore, conditional on X_1, the second entry, X_2, is also hypergeometric. The parameters for X_2 are m_W = m_2, m_B = m_3+\cdots+m_k and n=n-X_1. In fact, all of the conditionals of X are hypergeometric random variables and so can sampled sequentially. The python code at the end of this post implements the sequential sampler.

The complexity of the sequential method is bounded above by k times the complexity of sampling from the hypergeometric distribution. By using the ratio of uniform distribution this can be done in complexity on the order of \log(m_w+m_b). The overall complexity of the sequential method is therefore on the order of k\log(M) which can be much smaller than n\log(M).

import numpy as np

def multivariate_hypergeometric_sample(m, n):
    x = np.zeros_like(m)
    k = len(m)
    for i in range(k-1):
        x[i] = np.random.hypergeometric(m[i], np.sum(m[(i+1):]), n)
        n -= x[i]
    x[k-1] = n
    return x

The maximum of geometric random variables

I was working on a problem involving the maximum of a collection of geometric random variables. To state the problem, let (X_i)_{i=1}^n be i.i.d. geometric random variables with success probability p=p(n). Next, define M_n = \max\{X_i : 1 \le i \le n\}. I wanted to know the limiting distribution of M_n as p \to 0 and n \to \infty. I was particularly interested in the case when p was on the order of 1/n.

It turns out that this problem has a very nice solution: If p\to 0 as n \to \infty, then for all real numbers c

\displaystyle{\lim_{n \to \infty}\mathbb{P}\left(M_n  \le \frac{\log n}{p}+\frac{c}{p}\right) \to \exp(-e^{-c})}

That is, M_n is on the order of \frac{\log n}{p} and has fluctuations of order 1/p. Furthermore, these fluctuations are distributed according to the Gumbel distribution. Importantly, this result places no restrictions on how quickly p goes to 0.

In the rest of this post, I’ll derive the above identity. The first step relates M_n to the maximum of a collection of exponential random variables. The second step uses a coupling to prove that the exponential and geometric random variables are close to each other.

An easier version

When p goes to 0, the scaled geometric random variables p X_i converge to exponential random variables. Exponential random variables are nicer to work because they have a continuous distribution.

This suggests that we could try replacing all the geometric random variables with exponential random variables. More precisely, we will replace X_i with Y_i/p where Y_i are exponentially distributed. Next, we will derive the main result for the exponential random variables. Then, in the next section, we will prove that X_i is sufficiently close to Y_i/p.

Let \widetilde{M}_n = \max\{ Y_i/p : 1 \le i \le n\} where Y_i are i.i.d and exponentially distributed. This means that for y \ge 0, \mathbb{P}(Y_i \le y)=1-e^{-y}. The condition \widetilde{M}_n \le y is equivalent to Y_i \le py for all i =1,\ldots,n. Since the Y_i‘s are independent, this implies that for y \ge 0

\displaystyle{\mathbb{P}(\widetilde{M}_n \le y) = \prod_{i=1}^n \mathbb{P}(Y_i \le py) = \left(1-e^{-py}\right)^n}

Now fix a real number c and let y = \frac{\log n}{p}+\frac{c}{p}. Since n \to \infty, we know that y \ge 0 for large enough n. Thus, for sufficiently large n

\displaystyle{\mathbb{P}\left(\widetilde{M}_n \le  \frac{\log n}{p}+\frac{c}{p}\right)  = \left(1-e^{-\log n - c})\right)^n = \left(1-\frac{e^{-c}}{n}\right)^n\to \exp(-e^{-c})}

And so the main result holds for \widetilde{M}_n = \max\{Y_i/p : 1 \le i \le n\}. We want to show that it holds for M_n = \max\{X_i : 1 \le i \le n\}. The next section proves a close relationship between X_i and Y_i.

Exponential and geometric coupling

It is true that as p \to 0, pX_i converges in distribution to an exponential random variable. This is a result that holds for a single random variable. For an approximation like M_n \approx \widetilde{M}_n, we need a result that works for every X_i at once. This seems tricky since it involves the convergence of an infinite collection of random variables. Fortunately, there is a simple approach based on coupling.

For each geometric random variable X_i it is possible to produce an exponential random variables Y_i with

\displaystyle{\frac{1}{\lambda} Y_i \le X_i \le \frac{1}{\lambda} Y_i + 1},

where \lambda = -\log(1-p). This is because \mathbb{P}(X_i \le x) = 1-(1-p)^{[x]} and \mathbb{P}(Y_i/\lambda \le x) = 1-e^{-x/\lambda} = 1-(1-p)^{x}. This means that

\displaystyle{\mathbb{P}(Y_i/\lambda \le x-1) \le \mathbb{P}(X_i \le x)\le \mathbb{P}(Y_i/\lambda \le x-1) }.

If we sample both X_i and Y_i with inverse transform sampling, and use the same uniform random variables, then we will have

\displaystyle{ \frac{1}{\lambda} Y_i \le X_i \le \frac{1}{\lambda} Y_i + 1},

as claimed. If we use this inequality for every single pair (X_i,Y_i) we will get

\displaystyle{\max\{Y_i/ \lambda\} \le M_n \le \max\{Y_i/\lambda\} +1}

As p \to 0, we have that \lambda \to 0 and \frac{p}{\lambda}=\frac{p}{-\log(1-p)} \to 1. This means that the random variable \max\{Y_i/\lambda\} has the same limit as \widetilde{M}_n = \max\{ Y_i/p : 1 \le i \le n\}. Furthermore, \max\{Y_i/\lambda\} +1 also has the same limit as \widetilde{M}_n. Since M_n is “sandwiched” between \max\{Y_i/\lambda\} and \max\{Y_i/\lambda\} +1, the result must also hold for M_n. And so

\displaystyle{\lim_{n \to \infty}\mathbb{P}\left(M_n \le \frac{\log n}{p}+\frac{c}{p}\right) \to \exp(-e^{-c})}

“Uniformly random”

The term “uniformly random” sounds like a contradiction. How can the word “uniform” be used to describe anything that’s random? Uniformly random actually has a precise meaning, and, in a sense, means “as random as possible.” I’ll explain this with an example about shuffling card.

Shuffling cards

Suppose I have a deck of ten cards labeled 1 through 10. Initially, the cards are face down and in perfect order. The card labeled 10 is on top of the deck. The card labeled 9 is second from the top, and so on down to the card labeled 1. The cards are definitely not random.

Next, I generate a random number between 1 and 10. I then find the card with the corresponding label and put it face down on top of the deck. The cards are now somewhat random. The number on top could anything, but the rest of the cards are still in order. The cards are random but they are not uniformly random.

Now suppose that I keep generating random numbers and moving cards to the top of the deck. Each time I do this, the cards get more random. Eventually (after about 30 moves1) the cards will be really jumbled up. Even if you knew the first few cards, it would be hard to predict the order of the remaining ones. Once the cards are really shuffled, they are uniformly random.

Uniformly random

A deck of cards is uniformly random if each of the possible arrangements of the cards are equally likely. After only moving one card, the deck of cards is not uniformly random. This is because there are only 10 possible arrangements of the deck. Once the deck is well-shuffled, all of the 3,628,800 possible arrangements are equally likely.

In general, something is uniformly random if each possibility is equally likely. So the outcome of rolling a fair 6-sided die is uniformly random, but rolling a loaded die is not. The word “uniform” refers to the chance of each possibility (1/6 for each side of the die). These chances are all the same and “uniform”.

This is why uniformly random can mean “as random as possible.” If you are using a fair die or a well-shuffled deck, there are no biases in the outcome. Mathematically, this means you can’t predict the outcome.

Communicating research

The inspiration for this post came from a conversation I had earlier in the week. I was telling someone about my research. As an example, I talked about how long it takes for a deck of cards to become uniformly random. They quickly stopped me and asked how the two words could ever go together. It was a good point! I use the words uniformly random all the time and had never realized this contradiction.2 It was a good reminder about the challenge of clear communication.

Footnotes

  1. The number of moves it takes for the deck to well-shuffled is actually random. But on average it takes around 30 moves. For the mathematical details, see Example 1 in Shuffling Cards and Stopping Times by David Aldous and Persi Diaconis. ↩︎
  2. Of the six posts I published last year, five contain the word uniform! ↩︎

Understanding the Ratio of Uniforms Distribution

The ratio of uniforms distribution is a useful distribution for rejection sampling. It gives a simple and fast way to sample from discrete distributions like the hypergeometric distribution1. To use the ratio of uniforms distribution in rejection sampling, we need to know the distributions density. This post summarizes some properties of the ratio of uniforms distribution and computes its density.

The ratio of uniforms distribution is the distribution of the ratio of two independent uniform random variables. Specifically, suppose U \in [-1,1] and V \in [0,1] are independent and uniformly distributed. Then R = U/V has the ratio of uniforms distribution. The plot below shows a histogram based on 10,000 samples from the ratio of uniforms distribution2.

The histogram has a flat section in the middle and then curves down on either side. This distinctive shape is called a “table mountain”. The density of R also has a table mountain shape.

And here is the density plotted on top of the histogram.

A formula for the density of R is

\displaystyle{h(R) = \begin{cases} \frac{1}{4} & \text{if } -1 \le R \le 1, \\\frac{1}{4R^2} & \text{if } R < -1 \text{ or } R > 1.\end{cases}}

The first case in the definition of h corresponds to the flat part of the table mountain. The second case corresponds to the sloping curves. The rest of this post use geometry to derive the above formula for h(R).

Calculating the density

The point (U,V) is uniformly distributed in the box B=[-1,1] \times [0,1]. The image below shows an example of a point (U,V) inside the box B.

We can compute the ratio R = U/V geometrically. First we draw a straight line that starts at (0,0) and goes through (U,V). This line will hit the horizontal line y=1. The x coordinate at this point is exactly R=U/V.

In the above picture, all of the points on the dashed line map to the same value of R. We can compute the density of R by computing an area. The probability that R is in a small interval [R,R+dR] is

\displaystyle{\frac{\text{Area}(\{(u,v) \in B : u/v \in [R, R+dR]\})}{\text{Area}(B)} = \frac{1}{2}\text{Area}(\{(u,v) \in B : u/v \in [R, R+dR]\}).}

If we can compute the above area, then we will know the density of R because by definition

\displaystyle{h(R) = \lim_{dR \to 0} \frac{1}{2dR}\text{Area}(\{(u,v) \in B : u/v \in [R, R+dR]\})}.

We will first work on the case when R is between -1 and 1. In this case, the set \{(u,v) \in B : u/v \in [R, R+dR]\} is a triangle. This triangle is drawn in blue below.

The horizontal edge of this triangle has length dR. The perpendicular height of the triangle from the horizontal edge is 1. This means that

\displaystyle{\text{Area}(\{(u,v) \in B : u/v \in [R, R+dR]\}) =\frac{1}{2}\times dR \times 1=\frac{dR}{2}}.

And so, when R \in [-1,1] we have

\displaystyle{h(R) = \lim_{dR\to 0} \frac{1}{2dR}\times \frac{dR}{2}=\frac{1}{4}}.

Now let’s work on the case when R is bigger than 1 or less than -1. In this case, the set \{(u,v) \in B : u/v \in [R, R+dR]\} is again triangle. But now the triangle has a vertical edge and is much skinnier. Below the triangle is drawn in red. Note that only points inside the box B are coloured in.

The vertical edge of the triangle has length \frac{1}{R} - \frac{1}{R+dR}= \frac{dR}{R(R+dR)}. The perpendicular height of the triangle from the vertical edge is 1. Putting this together

\displaystyle{\text{Area}(\{(u,v) \in B : u/v \in [R, R+dR]\}) =\frac{1}{2}\times \frac{dR}{R(R+dR)} \times 1=\frac{dR}{2R(R+dR)}}.

And so

\displaystyle{h(R) = \lim_{dR \to 0} \frac{1}{2dR} \times \frac{dR}{2 R(R+dR)} = \frac{1}{4R^2}}.

And so putting everything together

\displaystyle{h(R) = \begin{cases} \frac{1}{4} & \text{if } -1 \le R \le 1, \\\frac{1}{4R^2} & \text{if } R < -1 \text{ or } R > 1.\end{cases}}

Footnotes and references

  1. https://ieeexplore.ieee.org/document/718718 ↩︎
  2. For visual purposes, I restricted the sample to values of R between -8 and 8. This is because the ratio of uniform distribution has heavy tails. This meant that there were some very large values of R that made the plot hard to see. ↩︎

The discrete arcsine distribution

The discrete arcsine distribution is a probability distribution on \{0,1,\ldots,n\}. It is a u-shaped distribution. There are peaks at 0 and n and a dip in the middle. The figure below shows the probability distribution function for n=10,15, 20.

The probability distribution function of the arcsine distribution is given by

\displaystyle{p_n(k) = \frac{1}{2^{2n}}\binom{2k}{k}\binom{2n-2k)}{n-k}\text{ for } k \in \{0,1,\ldots,n\}}

The discrete arcsine distribution is related to simple random walks and to an interesting Markov chain called the Burnside process. The connection with simple random walks is explained in Chapter 3, Volume 1 of An Introduction to Probability and its applications by William Feller. The connection to the Burnside process was discovered by Persi Diaconis in Analysis of a Bose-Einstein Markov Chain.

The discrete arcsine distribution gets its name from the continuous arcsine distribution. Suppose X_n is distributed according to the discrete arcsine distribution with parameter n. Then the normalized random variables X_n/n converges in distribution to the continuous arcsine distribution on [0,1]. The continuous arcsine distribution has the probability density function

\displaystyle{f(x) = \frac{1}{\pi\sqrt{x(1-x)}}  \text{ for } 0 \le x \le 1}

This means that continuous arcsine distribution is a beta distribution with \alpha=\beta=1/2. It is called the arcsine distribution because the cumulative distribution function involves the arcsine function

\displaystyle{F(x) = \int_0^x f(y)dy = \frac{2}{\pi}\arcsin(\sqrt{x}) \text{ for } 0 \le x \le 1}

There is another connection between the discrete and continuous arcsine distributions. The continuous arcsine distribution can be used to sample the discrete arcsine distribution. The two step procedure below produces a sample from the discrete arcsine distribution with parameter n:

  1. Sample p from the continuous arcsine distribution.
  2. Sample X from the binomial distribution with parameters n and p.

This means that the discrete arcsine distribution is actually the beta-binomial distribution with parameters \alpha = \beta =1/2. I was surprised when I was told this, and couldn’t find a reference. The rest of this blog post proves that the discrete arcsine distribution is an instance of the beta-binomial distribution.

As I showed in this post, the beta-binomial distribution has probability distribution function:

\displaystyle{q_{\alpha,\beta,n}(k) = \binom{n}{k}\frac{B(k+\alpha, n-k+\alpha)}{B(a,b)}},

where B(x,y)=\frac{\Gamma(x)\Gamma(y)}{\Gamma(x+y)} is the Beta-function. To show that the discrete arc sine distribution is an instance of the beta-binomial distribution we need that p_n(k)=q_{1/2,1/2,n}(k). That is

\displaystyle{ \binom{n}{k}\frac{B(k+1/2, n-k+1/2)}{B(1/2,1/2)} = \frac{1}{2^{2n}}\binom{2k}{k}\binom{2n-2k}{n-k}},

for all k = 0,1,\ldots,n. To prove the above equation, we can first do some simplifying to q_{1/2,1/2,n}(k). By definition

\displaystyle{\frac{B(k+1/2, n-k+1/2)}{B(1/2,1/2)} = \frac{\frac{\Gamma(k+1/2)\Gamma(n-k+1/2)}{\Gamma(n+1)}}{\frac{\Gamma(1/2)\Gamma(1/2)}{\Gamma(1)}} = \frac{1}{n!}\frac{\Gamma(k+1/2)}{\Gamma(1/2)}\frac{\Gamma(n-k+1/2)}{\Gamma(1/2)}},

where I have used that \Gamma(m)=(m-1)! factorial if m is a natural number. The Gamma function \Gamma(x) also satisfies the property \Gamma(x+1)=x\Gamma(x). Using this repeatedly gives

\displaystyle{\Gamma(k+1/2) = (k-1/2) \times (k-3/2) \times \cdots \times \frac{3}{2}\times\frac{1}{2}\times\Gamma(1/2) }.

This means that

\displaystyle{\frac{\Gamma(k+1/2)}{\Gamma(1/2)} = (k-1/2) \times (k-3/2) \times \cdots \times \frac{3}{2}\times\frac{1}{2} = \frac{(2k-1)\times(2k-3)\times \cdots \times 3 \times 1}{2^k}=\frac{(2k-1)!!}{2^k}},

where (2k-1)!!=(2k-1)\times (2k-3)\times\cdots \times 3 \times 1 is the double factorial. The same reasoning gives

\displaystyle{\frac{\Gamma(n-k+1/2)}{\Gamma(1/2)} =\frac{(2n-2k-1)!!}{2^{n-k}}}.

And so

\displaystyle{q_{1/2,1/2,n}(k) =\frac{1}{2^nk!(n-k)!}(2k-1)!!(2n-2k-1)!!}.

We’ll now show that p_n(k) is also equal to the above final expression. Recall

\displaystyle{p_n(k) = \frac{1}{2^{2n}} \binom{2k}{k}\binom{2(n-k)}{n-k} = \frac{1}{2^{2n}}\frac{(2k)!(2(n-k))!}{k!k!(n-k)!(n-k)!} = \frac{1}{2^nk!(n-k)!}\frac{(2k)!}{k!2^k}\frac{(2n-2k)!}{(n-k)!2^{n-k}}}.

And so it suffices to show \frac{(2k)!}{k!2^k} = (2k-1)!! (and hence \frac{(2n-2k)!}{(n-k)!2^{n-k}}=(2n-2k-1)!!). To see why this last claim holds, note that

\displaystyle{\frac{(2k)!}{k!2^k} = \frac{(2k)\times (2k-1)\times(2k-2)\times\cdots\times 3 \times 2 \times 1}{(2k)\times (2k-2)\times \cdots \times 2} = (2k-1)!!}

Showing that p_{n}(k)=q_{n,1/2,1/2}(k) as claimed.

The sample size required for importance sampling

My last post was about using importance sampling to estimate the volume of high-dimensional ball. The two figures below compare plain Monte Carlo to using importance sampling with a Gaussian proposal. Both plots use M=1,000 samples to estimate v_n, the volume of an n-dimensional ball

A friend of mine pointed out that the relative error does not seem to increase with the dimension n. He thought it was too good to be true. It turns out he was right and the relative error does increase with dimension but it increases very slowly. To estimate v_n the number of samples needs to grow on the order of \sqrt{n}.

To prove this, we will use the paper The sample size required for importance sampling by Chatterjee and Diaconis [1]. This paper shows that the sample size for importance sampling is determined by the Kullback-Liebler divergence. The relevant result from their paper is Theorem 1.3. This theorem is about the relative error in using importance sampling to estimate a probability.

In our setting the proposal distribution is Q=\mathcal{N}(0,\frac{1}{n}I_n). That is the distribution Q is an n-dimensional Gaussian vector with mean 0 and covariance \frac{1}{n}I_n. The conditional target distribution is P the uniform distribution on the n dimensional ball. Theorem 1.3 in [1] tells us how many samples are needed to estimate v_n. Informally, the required sample size is M = O(\exp(D(P \Vert Q))). Here D(P\Vert Q) is the Kullback-Liebler divergence between P and Q.

To use this theorem we need to compute D(P \Vert Q). Kullback-Liebler divergence is defined as integral. Specifically

\displaystyle{D(P\Vert Q) = \int_{\mathbb{R}^n} \log\frac{P(x)}{Q(x)}P(x)dx}

Computing the high-dimensional integral above looks challenging. Fortunately, it can reduced to a one-dimensional integral. This is because both the distributions P and Q are rotationally symmetric. To use this, define P_r,Q_r to be the distribution of the norm squared under P and Q. That is if X \sim P, then \Vert X \Vert_2^2 \sim P_R and likewise for Q_R. By the rotational symmetry of P and Q we have

D(P\Vert Q) = D(P_R \Vert Q_R).

We can work out both P_R and Q_R. The distribution P is the uniform distribution on the n-dimensional ball. And so for X \sim P and any r \in [0,1]

\mathbb{P}(\Vert X \Vert_2^2 \le r) = \frac{v_n r^n}{v_n} = r^n.

Which implies that P_R has density P_R(r)=nr^{n-1}. This means that P_R is a Beta distribution with parameters \alpha = n, \beta = 1. The distribution Q is a multivariate Gaussian distribution with mean 0 and variance \frac{1}{n}I_n. This means that if X \sim Q, then \Vert X \Vert_2^2 = \sum_{i=1}^n X_i^2 is a scaled chi-squared variable. The shape parameter of Q_R is n and scale parameter is 1/n. The density for Q_R is therefor

Q_R(r) = \frac{n^{n/2}}{2^{n/2}\Gamma(n/2)}r^{n/2-1}e^{-nx/2}

The Kullback-Leibler divergence between P and Q is therefor

\displaystyle{D(P\Vert Q)=D(P_R\Vert Q_R) = \int_0^1 \log \frac{P_R(r)}{Q_R(r)} P_R(r)dr}

Getting Mathematica to do the above integral gives

D(P \Vert Q) = -\frac{1+2n}{2+2n} + \frac{n}{2}\log(2 e) - (1-\frac{n}{2})\log n + \log \Gamma(\frac{n}{2}).

Using the approximation \log \Gamma(z) \approx (z-\frac{1}{2})\log(z)-z+O(1) we get that for large n

D(P \Vert Q) = \frac{1}{2}\log n + O(1).

And so the required number of samples is O(\exp(D(P \Vert Q)) = O(\sqrt{n}).

[1] Chatterjee, Sourav, and Persi Diaconis. “THE SAMPLE SIZE REQUIRED IN IMPORTANCE SAMPLING.” The Annals of Applied Probability 28, no. 2 (2018): 1099–1135. https://www.jstor.org/stable/26542331. (Public preprint here https://arxiv.org/abs/1511.01437)

Monte Carlo integration in high dimensions

Monte Carlo integration

John Cook recently wrote a cautionary blog post about using Monte Carlo to estimate the volume of a high-dimensional ball. He points out that if \mathbf{X}=(X_1,\ldots,X_n) are independent and uniformly distributed on the interval [-1,1] then

\displaystyle{\mathbb{P}(X_1^2 + X_2^2 + \cdots + X_n^2 \le 1) = \frac{v_n}{2^n}},

where v_n is the volume of an n-dimensional ball with radius one. This observation means that we can use Monte Carlo to estimate v_n.

To do this we repeatedly sample vectors \mathbf{X}_m = (X_{m,1},\ldots,X_{m,n}) with X_{m,i} uniform on [-1,1] and m ranging from 1 to M. Next, we count the proportion of vectors \mathbf{X}_m with X_{1,m}^2 + \cdots + X_{n,m}^2 \le 1. Specifically, if Z_m is equal to one when X_{1,m}^2 + \cdots + X_{n,m}^2 \le 1 and equal to zero otherwise, then by the law of large numbers

\displaystyle{\frac{1}{M}\sum_{m=1}^M Z_m \approx \frac{v_n}{2^n}}

Which implies

v_n \approx 2^n \frac{1}{M}\sum_{m=1}^M Z_m

This method of approximating a volume or integral by sampling and counting is called Monte Carlo integration and is a powerful general tool in scientific computing.

The problem with Monte Carlo integration

As pointed out by John, Monte Carlo integration does not work very well in this example. The plot below shows a large difference between the true value of v_n with n ranging from 1 to 20 and the Monte Carlo approximation with M = 1,000.

The problem is that v_n is very small and the probability v_n/2^n is even smaller! For example when n = 10, v_n/2^n \approx 0.0025. This means that in our one thousand samples we only expect two or three occurrences of the event X_{m,1}^2 + \cdots + X_{m,n}^2 \le 1. As a result our estimate has a high variance.

The results get even worse as n increases. The probability v_n/2^n does to zero faster than exponentially. Even with a large value of M, our estimate will be zero. Since v_n \neq 0, the relative error in the approximation is 100%.

Importance sampling

Monte Carlo can still be used to approximate v_n. Instead of using plain Monte Carlo, we can use a variance reduction technique called importance sampling (IS). Instead of sampling the points X_{m,1},\ldots, X_{m,n} from the uniform distribution on [-1,1], we can instead sample the from some other distribution called a proposal distribution. The proposal distribution should be chosen so that that the event X_{m,1}^2 +\cdots +X_{m,n}^2 \le 1 becomes more likely.

In importance sampling, we need to correct for the fact that we are using a new distribution instead of the uniform distribution. Instead of the counting the number of times X_{m,1}^2+\cdots+X_{m,n}^2 \le 1 , we give weights to each of samples and then add up the weights.

If p is the density of the uniform distribution on [-1,1] (the target distribution) and q is the density of the proposal distribution, then the IS Monte Carlo estimate of v_n is

\displaystyle{\frac{1}{M}\sum_{m=1}^M Z_m \prod_{i=1}^n \frac{p(X_{m,i})}{q(X_{m,i})}},

where as before Z_m is one if X_{m,1}^2+\cdots +X_{m,n}^2 \le 1 and Z_m is zero otherwise. As long as q(x)=0 implies p(x)=0, the IS Monte Carlo estimate will be an unbiased estimate of v_n. More importantly, a good choice of the proposal distribution q can drastically reduce the variance of the IS estimate compared to the plain Monte Carlo estimate.

In this example, a good choice of proposal distribution is the normal distribution with mean 0 and variance 1/n. Under this proposal distribution, the expected value of X_{m,1}^2 +\cdots+ X_{m,n}^2 is one and so the event X_{m,1}^2 + \cdots + X_{m,n}^2 \le 1 is much more likely.

Here are the IS Monte Carlo estimates with again M = 1,000 and n ranging from 1 to 20. The results speak for themselves.

The relative error is typically less than 10%. A big improvement over the 100% relative error of plain Monte Carlo.

The next plot shows a close agreement between v_n and the IS Monte Carlo approximation on the log scale with n going all the way up to 100.

Notes

  1. There are exact formulas for v_n (available on Wikipedia). I used these to compare the approximations and compute relative errors. There are related problems where no formulas exist and Monte Carlo integration is one of the only ways to get an approximate answer.
  2. The post by John Cook also talks about why the central limit theorem can’t be used to approximate v_n. I initially thought a technique called large deviations could be used to approximate v_n but again this does not directly apply. I was happy to discover that importance sampling worked so well!

More sampling posts

Uniformly sampling orthogonal matrices

An n \times n matrix M \in \mathbb{R}^{n \times n} is orthogonal if M^\top M = I. The set of all n \times n orthogonal matrices is a compact group written as O_n. The uniform distribution on O_n is called Haar measure. There are many ways to generate random matrices for Haar measure. One of which is based on the Gram-Schmidt algorithm.

Proposition. Let X be a n \times n matrix such that each entry X_{ij} is an independent standard normal random variable. Let x_1,x_2,\ldots,x_n \in \mathbb{R}^n be the columns of X and let q_1,q_2,\ldots,q_n \in \mathbb{R}^n be the result of applying Gram-Schmidt to x_1,x_2,\ldots,x_n. Then the matrix M=[q_1,q_2,\ldots,q_n] \in O_n is distributed according to Haar measure.

Another way of stating the above proposition is that if X has i.i.d. standard normal entries, and Q,R is a QR-factorization of X calculated using Gram-Schmidt, then Q \in O_n is distributed according to Haar measure. However, most numerical libraries do not use Gram-Schmidt to calculate the QR-factorization of a matrix. This means that if you generate a random X and then use your computer’s QR-factorization function, then the result might not be Haar distributed as the plot below shows.

The plot shows an estimate of the distribution of \sqrt{n}M_{11}, the top-left entry of a matrix M. When M is correctly sampled, the distribution is symmetric about 0. When M is incorrectly sampled the distribution is clearly biased towards negative numbers.

Fortunately there is a fix described in [1]! We can first use any QR-factorization algorithm to produce Q,R with X=QR. We then compute a diagonal matrix L with L_{ii} = \mathrm{Sign}(R_{ii}). Then, the matrix M=QL is Haar distributed. The following python code thus samples a Haar distributed matrix in O_n.

import numpy as np

def sample_M(n):
M = np.random.randn(n, n)
Q, R = np.linalg.qr(M)
L = np.sign(np.diag(R))
return Q*L[None,:]

References

[1] Mezzadri, Francesco. “How to generate random matrices from the classical compact groups.” arXiv preprint math-ph/0609050 (2006). https://arxiv.org/abs/math-ph/0609050

I recently gave a talk at the Stanford student probability seminar on Haar distributed random matrices. My notes and many further references are available here.