The Multivariate Hypergeometric Distribution

The hypergeometric distribution describes the number of white balls in a sample of $n$ balls draw without replacement from an urn with $m_W$ white balls and $m_B$ black balls. The probability of having $x$ white balls in the sample is:

$\displaystyle{f(x; m_W, m_B, n) = \frac{\binom{m_W}{x} \binom{m_B}{n-x}}{\binom{m_W+m_B}{n}}}$

This is because there are $\binom{m_W+m_B}{n}$ possible ways to sample $n$ balls from the urn which has $m_W+m_B$ black balls in total. Of these possibilities, there are $\binom{m_W}{x} \binom{m_B}{n-x}$ ways of selecting exactly $x$ white balls.

The multivariate hypergeometric distribution is a generalization of the hypergeometric distribution. In the multivariate hypergeometric distribution, there are $k$ colours instead of $2$ . Suppose that the urn contains $m_1$ balls of the first colour, $m_2$ balls of the second colour and so on. The total number of balls in the urn is $M = m_1+m_2+\cdots + m_k$ . Next we draw $n$ balls from the urn. Let $X_i$ be the number of balls of colour $i$ in our sample. The vector $X=(X_1,X_2,\ldots, X_k)$ follows the multivariate hypergeometric distribution. The vector $X$ satisfies the constraints $X_i \in \{0,1,\ldots,M_i\}$ and $\sum_{i=1}^k X_i = n$ . The probability of observing $X=x$ is

$\displaystyle{f(x; m, n) = \frac{\binom{m_1}{x_1} \binom{m_2}{x_2}\cdots \binom{m_k}{x_k}}{\binom{M}{n} }, \quad x_1+x_2+\cdots+x_k = n}$

This is because there are $\binom{M}{n}$ possible ways of sampling $n$ balls from the urn that has $M$ balls in total. Likewise, The number of ways of choosing $x_i$ balls of colour $m_i$ is $\binom{m_i}{x_i}$ .

Sampling

There are two common methods for sampling from the multivariate hypergeometric distribution. The first option is to sample $n$ numbers from $1,\ldots,M$ without replacement. The numbers $1,\ldots,M$ represent the balls in the urn. We can let $1,\ldots,m_1$ represent the numbers with colour 1, and let $m_1+1,\ldots,m_1+m_2$ represent the numbers with colour 2 and so on. Suppose that $Y_1,\ldots,Y_n$ is the sample of size $n$ without replacement from $1,\ldots,M$ . Next define:

$\displaystyle{X_i = \#\{j : X_1+X_2+\cdots+X_{i-1} < Y_j \le X_1+X_2+\cdots+X_{i-1}+X_i\}}$

Then the vector $X=(X_1,\ldots,X_k)$ will be distributed according to the multivariate hypergeometric distribution. This method of sampling has complexity on the order of $n\log(M)$ . This is because sampling a random number of size $M$ has complexity $\log(M)$ and we need to sample $n$ of these numbers.

There is a second method of sampling that is more efficient when $n$ is large and $k$ is small. This is a sequential method that samples each $X_i$ from a (univariate) hypergeometric distribution. The idea is that when sampling $X_1$ we can treat the balls of colour $1$ as “white” and all the other colours as “black”. This means that marginally, $X_1$ is hypergeometric with parameters $m_W = m_1$ and $m_B = m_2+\cdots + m_k$ and $n = n$ . Furthermore, conditional on $X_1$ , the second entry, $X_2$ , is also hypergeometric. The parameters for $X_2$ are $m_W = m_2$ , $m_B = m_3+\cdots+m_k$ and $n=n-X_1$ . In fact, all of the conditionals of $X$ are hypergeometric random variables and so can sampled sequentially. The python code at the end of this post implements the sequential sampler.

The complexity of the sequential method is bounded above by $k$ times the complexity of sampling from the hypergeometric distribution. By using the ratio of uniform distribution this can be done in complexity on the order of $\log(m_w+m_b)$ . The overall complexity of the sequential method is therefore on the order of $k\log(M)$ which can be much smaller than $n\log(M)$ .

import numpy as np

def multivariate_hypergeometric_sample(m, n):
    x = np.zeros_like(m)
    k = len(m)
    for i in range(k-1):
        x[i] = np.random.hypergeometric(m[i], np.sum(m[(i+1):]), n)
        n -= x[i]
    x[k-1] = n
    return x

Sampling

Share this:

Related

Leave a comment Cancel reply