## How to Bake Pi, Sherman-Morrison and log-sum-exp

A few months ago, I had the pleasure of reading Eugenia Cheng‘s book How to Bake Pi. Each chapter starts with a recipe which Cheng links to the mathematical concepts contained in the chapter. The book is full of interesting connections between mathematics and the rest of the world.

One of my favourite ideas in the book is something Cheng writes about equations and the humble equals sign: $=$. She explains that when an equation says two things are equal we very rarely mean that they are exactly the same thing. What we really mean is that the two things are the same in some ways even though they may be different in others.

One example that Cheng gives is the equation $a + b = b+a$. This is such a familiar statement that you might really think that $a+b$ and $b+a$ are the same thing. Indeed, if $a$ and $b$ are any numbers, then the number you get when you calculate $a + b$ is the same as the number you get when you calculate $b + a$. But calculating $a+b$ could be very different from calculating $b+a$. A young child might calculate $a+b$ by starting with $a$ and then counting one-by-one from $a$ to $a + b$. If $a$ is $1$ and $b$ is $20$, then calculating $a + b$ requires counting from $1$ to $21$ but calculating $b+a$ simply amounts to counting from $20$ to $21$. The first process takes way longer than the second and the child might disagree that $1 + 20$ is the same as $20 + 1$.

In How to Bake Pi, Cheng explains that a crucial idea behind equality is context. When someone says that two things are equal we really mean that they are equal in the context we care about. Cheng talks about how context is crucial through-out mathematics and introduces a little bit of category theory as a tool for moving between different contexts. I think that this idea of context is really illuminating and I wanted to share some examples where “$=$” doesn’t mean “exactly the same as”.

### The Sherman-Morrison formula

The Sherman-Morrison formula is a result from linear algebra that says for any invertible matrix $A \in \mathbb{R}^{n\times n}$ and any pair of vectors $u,v \in \mathbb{R}^n$, if $v^TA^{-1}u \neq -1$, then $A + uv^T$ is invertible and

$(A + uv^T)^{-1} = A^{-1} + \frac{A^{-1}uv^TA^{-1}}{1 + v^TA^{-1}u}$

Here “$=$” means the following:

1. You can take any natural number $n$, any matrix $A$ of size $n$ by $n$, and any length $n$-vectors $u$ and $v$ that satisfy the above condition.
2. If you take all those things and carry out all the matrix multiplications, additions and inversions on the left and all the matrix multiplications, additions and inversions on the right, then you will end up with exactly the same matrix in both cases.

But depending on the context, the equation on one side of “$=$” may be much easier than the other. Although the right hand side looks a lot more complicated, it is much easier to compute in one important context. This context is when we have already calculated the matrix $A^{-1}$ and now want the inverse of $A + uv^T$. The left hand side naively computes $A + uv^T$ which takes $O(n^3)$ computations since we have to invert a $n \times n$ matrix. On the right hand side, we only need to compute a small number of matrix-vector products and then add two matrices together. This bring the computational cost down to $O(n^2)$.

These cost saving measures come up a lot when studying linear regression. The Sherman-Morrison formula can be used to update regression coefficients when a new data point is added. Similarly, the Sherman-Morrison formula can be used to quickly calculate the fitted values in leave-one-out cross validation.

### log-sum-exp

This second example also has connections to statistics. In a mixture model, we assume that each data point $Y$ comes from a distribution of the form:

$p(y|\pi,\theta) = \sum_{k=1}^K \pi_k p(y | \theta_k)$,

where $\pi$ is a vector and $\pi_k$ is equal to the probability that $Y$ came from class $k$. The parameters $\theta_k \in\mathbb{R}^p$ are the parameters for the $k^{th}$ group.The log-likelihood is thus,

$\log\left(\sum_{k=1}^K \pi_k p(y | \theta_k)\right) = \log\left(\sum_{k=1}^K \exp(\eta_{k})\right)$,

where $\eta_{k} = \log(\pi_k p(y| \theta_k))$. We can see that the log-likelihood is of the form log-sum-exp. Calculating a log-sum-exp can cause issues with numerical stability. For instance if $K = 3$ and $\eta_k = 1000$, for all $k=1,2,3$, then the final answer is simply $\log(3)+1000$. However, as soon as we try to calculate $\exp(1000)$ on a computer, we’ll be in trouble.

The solution is to use the following equality, for any $\beta \in \mathbb{R}$,

$\log\left(\sum_{k=1}^K \exp(\eta_k) \right) = \beta + \log\left(\sum_{k=1}^K \exp(\beta - \eta_k)\right)$.

Proving the above identity is a nice exercise in the laws of logarithm’s and exponential’s, but with a clever choice of $\beta$ we can more safely compute the log-sum-exp expression. For instance, in the documentation for pytorch’s implementation of logsumexp() they take $\beta$ to be the maximum of $\eta_k$. This (hopefully) makes each of the terms $\beta - \eta_k$ a reasonable size and avoids any numerical issues.

Again, the left and right hand sides of the above equation might be the same number, but in the context of having to use computers with limited precision, they represent very different calculations.

### Beyond How to Bake Pi

Eugenia Cheng has recently published a new book called The Joy of Abstraction. I’m just over half way through and it’s been a really engaging and interesting introduction to category theory. I’m looking forward to reading the rest of it and getting more insight from Eugenia Cheng’s great mathematical writing.