The Multivariate Hypergeometric Distribution

The hypergeometric distribution describes the number of white balls in a sample of n balls draw without replacement from an urn with m_W white balls and m_B black balls. The probability of having x white balls in the sample is:

\displaystyle{f(x; m_W, m_B, n) = \frac{\binom{m_W}{x} \binom{m_B}{n-x}}{\binom{m_W+m_B}{n}}}

This is because there are \binom{m_W+m_B}{n} possible ways to sample n balls from the urn which has m_W+m_B black balls in total. Of these possibilities, there are \binom{m_W}{x} \binom{m_B}{n-x} ways of selecting exactly x white balls.

The multivariate hypergeometric distribution is a generalization of the hypergeometric distribution. In the multivariate hypergeometric distribution, there are k colours instead of 2. Suppose that the urn contains m_1 balls of the first colour, m_2 balls of the second colour and so on. The total number of balls in the urn is M = m_1+m_2+\cdots + m_k. Next we draw n balls from the urn. Let X_i be the number of balls of colour i in our sample. The vector X=(X_1,X_2,\ldots, X_k) follows the multivariate hypergeometric distribution. The vector X satisfies the constraints X_i \in \{0,1,\ldots,M_i\} and \sum_{i=1}^k X_i = n. The probability of observing X=x is

\displaystyle{f(x; m, n) = \frac{\binom{m_1}{x_1} \binom{m_2}{x_2}\cdots \binom{m_k}{x_k}}{\binom{M}{n} }, \quad x_1+x_2+\cdots+x_k = n}

This is because there are \binom{M}{n} possible ways of sampling n balls from the urn that has M balls in total. Likewise, The number of ways of choosing x_i balls of colour m_i is \binom{m_i}{x_i}.

Sampling

There are two common methods for sampling from the multivariate hypergeometric distribution. The first option is to sample n numbers from 1,\ldots,M without replacement. The numbers 1,\ldots,M represent the balls in the urn. We can let 1,\ldots,m_1 represent the numbers with colour 1, and let m_1+1,\ldots,m_1+m_2 represent the numbers with colour 2 and so on. Suppose that Y_1,\ldots,Y_n is the sample of size n without replacement from 1,\ldots,M. Next define:

\displaystyle{X_i = \#\{j : X_1+X_2+\cdots+X_{i-1} < Y_j \le X_1+X_2+\cdots+X_{i-1}+X_i\}}

Then the vector X=(X_1,\ldots,X_k) will be distributed according to the multivariate hypergeometric distribution. This method of sampling has complexity on the order of n\log(M). This is because sampling a random number of size M has complexity \log(M) and we need to sample n of these numbers.

There is a second method of sampling that is more efficient when n is large and k is small. This is a sequential method that samples each X_i from a (univariate) hypergeometric distribution. The idea is that when sampling X_1 we can treat the balls of colour 1 as “white” and all the other colours as “black”. This means that marginally, X_1 is hypergeometric with parameters m_W = m_1 and m_B = m_2+\cdots + m_k and n = n. Furthermore, conditional on X_1, the second entry, X_2, is also hypergeometric. The parameters for X_2 are m_W = m_2, m_B = m_3+\cdots+m_k and n=n-X_1. In fact, all of the conditionals of X are hypergeometric random variables and so can sampled sequentially. The python code at the end of this post implements the sequential sampler.

The complexity of the sequential method is bounded above by k times the complexity of sampling from the hypergeometric distribution. By using the ratio of uniform distribution this can be done in complexity on the order of \log(m_w+m_b). The overall complexity of the sequential method is therefore on the order of k\log(M) which can be much smaller than n\log(M).

import numpy as np

def multivariate_hypergeometric_sample(m, n):
    x = np.zeros_like(m)
    k = len(m)
    for i in range(k-1):
        x[i] = np.random.hypergeometric(m[i], np.sum(m[(i+1):]), n)
        n -= x[i]
    x[k-1] = n
    return x

Leave a comment