Maximum likelihood estimation and the method of moments

Maximum likelihood and the method of moments are two ways to estimate parameters from data. In general, the two methods can differ but for one-dimensional exponential families they produce the same estimates.

Suppose that \{P_\theta\}_{\theta \in \Omega} is a one-dimensional exponential family written in canonical form. That is, \Omega \subseteq \mathbb{R} and there exists a reference measure \mu such each distribution P_\theta has a density p_\theta with respect to \mu and,

p_\theta(x) = h(x)\exp\left(\theta T(x)-A(\theta)\right).

The random variable T(X) is a sufficient statistic for the model X \sim P_\theta. The function A(\theta) is the log-partition function for the family \{P_\theta\}_{\theta \in \Omega}. The condition, \int p_\theta(x)\mu(dx)=1 implies that

A(\theta) = \log\left(\int h(x)\exp(\theta T(x))\mu(dx) \right).

It turns out that the function A(\theta) is differentiable and that differentiation and integration are exchangeable. This implies that

A'(\theta) = \frac{\int h(x)\frac{d}{d\theta}\exp(\theta T(x))\mu(dx)}{\int h(x)\exp(\theta T(x))\mu(dx)} = \frac{\int h(x)\frac{d}{d\theta}\exp(\theta T(x))\mu(dx)}{\exp(A(\theta))}

Note that \int h(x)\frac{d}{d\theta}\exp(\theta T(x))\mu(dx) = \int T(x)h(x)\exp(\theta T(x))\mu(dx). Thus,

A'(\theta) = \int T(x)h(x) \exp(\theta T(x)-A(\theta))\mu(dx) = \int T(x)p_\theta(x)\mu(dx).

This means that A'(\theta) = \mathbb{E}_\theta[T(X)], the expectation of T(X) under X \sim P_\theta.

Now suppose that we have an i.i.d. sample X_1,\ldots, X_n \sim P_\theta and we want to use this sample to estimate \theta. One way to estimate \theta is by maximum likelihood. That is, we choose the value of \theta that maximises the likelihood,

L(\theta|X_1,\ldots,X_n) = \prod_{i=1}^n p_\theta(X_i).

When using the maximum likelihood estimator, it is often easier to work with the log-likelihood. The log-likelihood is,

\log L(\theta|X_1,\ldots,X_n) = \sum_{i=1}^n \log\left(p_\theta(X_i)\right) = \sum_{i=1}^n \log(h(X_i))+\theta T(X_i) - A(\theta).

Maximising the likelihood is equivalent to maximising the log-likelihood. For exponential families, the log-likelihood is a concave function of \theta. Thus the maximisers can be found be differentiation and solving the first order equations. Note that,

\frac{d}{d\theta} \log L(\theta|X_1,\ldots,X_n) =\sum_{i=1}^n T(X_i)-A'(\theta) = -nA'(\theta) + \sum_{i=1}^n T(X_i).

Thus the maximum likelihood estimate (MLE) \widehat{\theta} solves the equation,

-nA'(\widehat{\theta}) + \sum_{i=1}^n T(X_i) = 0.

But recall that A'(\widehat{\theta}) = \mathbb{E}_{\widehat{\theta}}[T(X)]. Thus the MLE is the solution to the equation,

\mathbb{E}_{\widehat{\theta}}[T(X)] = \frac{1}{n}\sum_{i=1}^n T(X_i).

Thus the MLE is the value of \theta for which the expectation of T(X) matches the empirical average from our sample. That is, the maximum likelihood estimator for an exponential family is a method of moments estimator. Specifically, the maximum likelihood estimator matches the moments of the sufficient statistic T(X).

A counter example

It is a special property of maximum likelihood estimators that the MLE is a method of moments estimator for the sufficient statistic. When we leave the nice world of exponential families, the estimators may differ.

Suppose that we have data X_1,\ldots,X_n \sim P_\theta where P_\theta is the uniform distribution on [0,\theta]. A minimal sufficient statistic for this model is X_{(n)} – the maximum of X_1,\ldots, X_n. Given what we saw before, we might imague that the MLE for this model would be a method of moments estimator for X_{(n)} but this isn’t the case.

The likelihood for X_1,\ldots,X_n is,

L(\theta|X_1,\ldots,X_n) = \begin{cases} \frac{1}{\theta^n} & \text{if } X_{(n)} \le \theta,\\ 0 & \text{if } X_{(n)} > \theta. \end{cases}

Thus the MLE is \widehat{\theta} = X_{(n)}. However, under P_\theta, \frac{1}{\theta}X_{(n)} has a \text{Beta}(n,1) distribution. Thus, \mathbb{E}_\theta[X_{(n)}]  =   \theta \frac{n}{n+1} so the method of moments estimator would be \widehat{\theta}' = \frac{n+1}{n}X_{(n)} \neq X_{(n)} = \widehat{\theta}.