## Maximum likelihood estimation and the method of moments

Maximum likelihood and the method of moments are two ways to estimate parameters from data. In general, the two methods can differ but for one-dimensional exponential families they produce the same estimates.

Suppose that $\{P_\theta\}_{\theta \in \Omega}$ is a one-dimensional exponential family written in canonical form. That is, $\Omega \subseteq \mathbb{R}$ and there exists a reference measure $\mu$ such each distribution $P_\theta$ has a density $p_\theta$ with respect to $\mu$ and,

$p_\theta(x) = h(x)\exp\left(\theta T(x)-A(\theta)\right).$

The random variable $T(X)$ is a sufficient statistic for the model $X \sim P_\theta$. The function $A(\theta)$ is the log-partition function for the family $\{P_\theta\}_{\theta \in \Omega}$. The condition, $\int p_\theta(x)\mu(dx)=1$ implies that

$A(\theta) = \log\left(\int h(x)\exp(\theta T(x))\mu(dx) \right).$

It turns out that the function $A(\theta)$ is differentiable and that differentiation and integration are exchangeable. This implies that

$A'(\theta) = \frac{\int h(x)\frac{d}{d\theta}\exp(\theta T(x))\mu(dx)}{\int h(x)\exp(\theta T(x))\mu(dx)} = \frac{\int h(x)\frac{d}{d\theta}\exp(\theta T(x))\mu(dx)}{\exp(A(\theta))}$

Note that $\int h(x)\frac{d}{d\theta}\exp(\theta T(x))\mu(dx) = \int T(x)h(x)\exp(\theta T(x))\mu(dx)$. Thus,

$A'(\theta) = \int T(x)h(x) \exp(\theta T(x)-A(\theta))\mu(dx) = \int T(x)p_\theta(x)\mu(dx).$

This means that $A'(\theta) = \mathbb{E}_\theta[T(X)]$, the expectation of $T(X)$ under $X \sim P_\theta$.

Now suppose that we have an i.i.d. sample $X_1,\ldots, X_n \sim P_\theta$ and we want to use this sample to estimate $\theta$. One way to estimate $\theta$ is by maximum likelihood. That is, we choose the value of $\theta$ that maximises the likelihood,

$L(\theta|X_1,\ldots,X_n) = \prod_{i=1}^n p_\theta(X_i).$

When using the maximum likelihood estimator, it is often easier to work with the log-likelihood. The log-likelihood is,

$\log L(\theta|X_1,\ldots,X_n) = \sum_{i=1}^n \log\left(p_\theta(X_i)\right) = \sum_{i=1}^n \log(h(X_i))+\theta T(X_i) - A(\theta)$.

Maximising the likelihood is equivalent to maximising the log-likelihood. For exponential families, the log-likelihood is a concave function of $\theta$. Thus the maximisers can be found be differentiation and solving the first order equations. Note that,

$\frac{d}{d\theta} \log L(\theta|X_1,\ldots,X_n) =\sum_{i=1}^n T(X_i)-A'(\theta) = -nA'(\theta) + \sum_{i=1}^n T(X_i).$

Thus the maximum likelihood estimate (MLE) $\widehat{\theta}$ solves the equation,

$-nA'(\widehat{\theta}) + \sum_{i=1}^n T(X_i) = 0.$

But recall that $A'(\widehat{\theta}) = \mathbb{E}_{\widehat{\theta}}[T(X)]$. Thus the MLE is the solution to the equation,

$\mathbb{E}_{\widehat{\theta}}[T(X)] = \frac{1}{n}\sum_{i=1}^n T(X_i)$.

Thus the MLE is the value of $\theta$ for which the expectation of $T(X)$ matches the empirical average from our sample. That is, the maximum likelihood estimator for an exponential family is a method of moments estimator. Specifically, the maximum likelihood estimator matches the moments of the sufficient statistic $T(X)$.

## A counter example

It is a special property of maximum likelihood estimators that the MLE is a method of moments estimator for the sufficient statistic. When we leave the nice world of exponential families, the estimators may differ.

Suppose that we have data $X_1,\ldots,X_n \sim P_\theta$ where $P_\theta$ is the uniform distribution on $[0,\theta]$. A minimal sufficient statistic for this model is $X_{(n)}$ – the maximum of $X_1,\ldots, X_n$. Given what we saw before, we might imague that the MLE for this model would be a method of moments estimator for $X_{(n)}$ but this isn’t the case.

The likelihood for $X_1,\ldots,X_n$ is,

$L(\theta|X_1,\ldots,X_n) = \begin{cases} \frac{1}{\theta^n} & \text{if } X_{(n)} \le \theta,\\ 0 & \text{if } X_{(n)} > \theta. \end{cases}$

Thus the MLE is $\widehat{\theta} = X_{(n)}$. However, under $P_\theta$, $\frac{1}{\theta}X_{(n)}$ has a $\text{Beta}(n,1)$ distribution. Thus, $\mathbb{E}_\theta[X_{(n)}] = \theta \frac{n}{n+1}$ so the method of moments estimator would be $\widehat{\theta}' = \frac{n+1}{n}X_{(n)} \neq X_{(n)} = \widehat{\theta}$.