Growing up, my siblings and I would read a lot of Bill Watterson’s Calvin and Hobbes. I have fond memories of spending hours reading and re-reading our grandparents collection during the school holidays. The comic strip is witty, heartwarming and beautifully drawn.
Seeing Calvin take this test took me back to those school holidays. I remember grabbing paper and a pencil to work out the answer (I also remember getting teased by those previously mentioned siblings). Reading the strip years later, I realized there’s a neat way to work out the answer without having to write much down. Here’s the question again,
Jack and Joe leave their homes at the same time and drive toward each other. Jack drives at 60 mph, while Joe drives at 30 mph. They pass each other in 10 minutes.
How far apart were Jack and Joe when they started?
It can be hard to think about Jack and Joe traveling at different speeds. But we can simplify the problem by thinking from Joe’s perspective or frame of reference. From Joe’s frame of reference, he is staying perfectly still but Jack is zooming towards him at mph. Jack reaches Joe in 10 minutes which is one sixth of an hour. This means that the initial distance between them must have been miles.
It’s not as creative as Calvin’s private eye approach but at least Susie and I got the same answer.
Suppose we have independent events , each of which occur with probability . The event that all of the occur is . By using independence we can calculate the probability of ,
We could also get a lower bound on by using the union bound. This gives,
We can therefore conclude that . This is an instance of Bernoulli’s inequality. More generally, Bernoulli’s inequality states that
for all and . This inequality states the red line is always underneath the black curve in the below picture. For an interactive version of this graph where you can change the value of , click here.
Our probabilistic proof only applies to that case when is between and and is an integer. The more general result can be proved by using convexity. For , the function is convex on the set . The function is the tangent line of this function at the point . Convexity of means that the graph of is always above the tangent line . This tells us that .
For between and , the function is no longer convex but actually concave and the inequality reverses. For , becomes concave again. These two cases are visualized below. In the first picture and the red line is above the black one. In the second picture and the black line is back on top.
I used to have trouble understanding why inverting an matrix required the same order of operations as solving an system of linear equations. Specifically, if is an invertible matrix and is a length vector, then computing and solving the equation can both be done in floating point operations (flops). This surprised me because naively computing the columns of requires solving the systems of equations
where are the standard basis vectors. I thought this would mean that calculating would require roughly times as many flops as solving a single system of equations. It was only in a convex optimization lecture that I realized what was going on.
To solve a single system of equations , we first compute a factorization of such as the LU factorization. Computing this factorization takes flops but once we have it, we can use it solve any new system with flops. Now to solve , we can simply factor once and the perform solves using the factorization each of which requires an addition flops. The total flop count is then . Inverting the matrix requires the same order of flops as a single solve!
Of course, as John D Cook points out: you shouldn’t ever invert a matrix. Even if inverting and solving take the same order of flops, inverting is still more computational expensive and requires more memory.
I recently gave a talk on the Yang-Baxter equation. The focus of the talk was to state the connection between the Yang-Baxter equation and the braid relation. This connection comes from a system of interacting particles. In this post, I’ll go over part of my talk. You can access the full set of notes here.
Interacting particles
Imagine two particles on a line, each with a state that can be any element of a set . Suppose that the only way particles can change their states is by interacting with each other. An interaction occurs when two particles pass by each other. We could define a function that describes how the states change after interaction. Specifically, if the first particle is in state and the second particle is in state , then their states after interacting will be
where are the components of . Recall that the particles move past each other when they interact. Thus, to keep track of the whole system we need an element of to keep track of the states and a permutation to keep track of the positions.
Three particles
Now suppose that we have particles labelled . As before, each particle has a state in . We can thus keep track of the state of each particle with an element of . The particles also have a position which is described by a permutation . The order the entries of corresponds to the labels of the particles not their positions. A possible configuration is shown below:
A possible configuration of the three particles. The above configuration is escribed as having states and positions .
As before, the particles can interact with each other. However, we’ll now add the restriction that the particles can only interact two at a time and interacting particles must have adjacent positions. When two particles interact, they swap positions and their states change according to . The state and position of the remaining particle is unchanged. For example, in the above picture we could interact particles and . This will produce the below configuration:
The new configuration after interacting particles and in the first diagram. The configuration is now described by the states and the permutation .
To keep track of how the states of the particles change over time we will introduce three functions from to . These functions are . The function is given by applying to the coordinates of and acting by the identity on the remaining coordinate. In symbols,
The function exactly describes how the states of the three particles change when particles and interact. Now suppose that three particles begin in position and states . We cannot directly interact particles and since they are not adjacent. We have to pass first pass one of the particles through particle . This means that there are two ways we can interact particles and . These are illustrated below.
The two ways of passing through particle 2 to interact particles 2 and 3.
In the top chain of interactions, we first interact particles and . In this chain of interactions, the states evolve as follows:
In the bottom chain of interactions, we first interact particles and . On this chain, the states evolve in a different way:
Note that after both of these chains of interactions the particles are in position . The function is said to solve the Yang–Baxter equation if two chains of interactions also result in the same states.
Definition: A function is a solution to the set theoretic Yang–Baxter equation if,
This equation can be visualized as the “braid relation” shown below. Here the strings represent the three particles and interacting two particles corresponds to crossing one string over the other.
The braid relation.
Here are some examples of solutions to the set theoretic Yang-Baxter equation,
The identity on .
The swap map, .
If commute, then the function is a solution the Yang-Baxter equation.
In the full set of notes I talk about a number of extensions and variations of the Yang-Baxter equation. These include having more than three particles, allowing for the particle states to be entangle and the parametric Yang-Baxter equation.
If , then the -norm is convex. When , this function is not convex and actually concave when all the entries of are non-negative. On a recent exam for the course Convex Optimization, we were asked to prove this when . In this special case, the mathematics simplifies nicely.
When , the -norm can be described as the squared sum of square roots. Specifically,
Note that we can expand the square and rewrite the function as follows
If we restrict to with for all , then this function simplifies to
which is a sum of geometric means. The geometric mean function is concave when . This can be proved by calculating the Hessian of this function and verifying that it is negative semi-definite.
From this, we can conclude that each function is also concave. This is because is a linear function followed by a concave function. Finally, any sum of concave functions is also concave and thus is concave.
Hellinger distance
A similar argument can be used to show that Hellinger distance is a convex function. Hellinger distance, is defined on pairs of probability distributions and on a common set . For such a pair,
which certainly doesn’t look convex. However, we can expand the square and use the fact that and are probability distributions. This shows us that Helligener distance can be written as
Again, each function is concave and so the negative sum of such functions is convex. Thus, is convex.
The course
As a final comment, I’d just like to say how much I am enjoying the class. Prof. Stephen Boyd is a great lecturer and we’ve seen a wide variety of applications in the class. I’ve recently been reading a bit of John D Cook’s blog and agree with all he says about the course here.
Suppose you want to calculate an expression of the form
where . Such expressions can be difficult to evaluate directly since the exponentials can easily cause overflow errors. In this post, I’ll talk about a clever way to avoid such errors.
If there were no terms in the second sum we could use the log-sum-exp trick. That is, to calculate
we set and use the identity
Since for all , the left hand side of the above equation can be computed without the risk of overflow. To calculate,
we can use the above method to separately calculate
and
The final result we want is
Since $A > B$, the right hand side of the above expression can be evaluated safely and we will have our final answer.
R code
The R code below defines a function that performs the above procedure
# Safely compute log(sum(exp(pos)) - sum(exp(neg)))
# The default value for neg is an empty vector.
logSumExp <- function(pos, neg = c()){
max_pos <- max(pos)
A <- max_pos + log(sum(exp(pos - max_pos)))
# If neg is empty, the calculation is done
if (length(neg) == 0){
return(A)
}
# If neg is non-empty, calculate B.
max_neg <- max(neg)
B <- max_neg + log(sum(exp(neg - max_neg)))
# Check that A is bigger than B
if (A <= B) {
stop("sum(exp(pos)) must be larger than sum(exp(neg))")
}
# log1p() is a built in function that accurately calculates log(1+x) for |x| << 1
return(A + log1p(-exp(B - A)))
}
An example
The above procedure can be used to evaulate
Evaluating this directly would quickly lead to errors since R (and most other programming languages) cannot compute . However, R has the functions lfactorial() and lchoose() which can compute and for large values of and . We can thus put this expression in the general form at the start of this post
The following R code thus us exactly what we want:
The beta-binomial model is a Bayesian model used to analyze rates. For a great derivation and explanation of this model, I highly recommend watching the second lecture from Richard McElreath’s course Statistical Rethinking. In this model, the data, , is assumed to be binomially distributed with a fixed number of trail but an unknown rate . The rate is given a prior. That is the prior distribution of has a density
where is a normalizing constant. The model can thus be written as
This is a conjugate model, meaning that the posterior distribution of is again a beta distribution. This can be seen by using Bayes rule
The last expression is proportional to a beta density., specifically .
The marginal distribution of
In the above model we are given the distribution of and the conditional distribution of . To calculate the distribution of , we thus need to marginalize over . Specifically,
The term inside the above integral is
Thus,
This distribution is called the beta-binomial distribution. Below is an image from Wikipedia showing a graph of for and a number of different values of and . You can see that, especially for small value of and the distribution is a lot more spread out than the binomial distribution. This is because there is randomness coming from both and the binomial conditional distribution.
Suppose we have two samples and and we want to test if they are from the same distribution. Many popular tests can be reinterpreted as correlation tests by pooling the two samples and introducing a dummy variable that encodes which sample each data point comes from. In this post we will see how this plays out in a simple t-test.
The equal variance t-test
In the equal variance t-test, we assume that and , where is unknown. Our hypothesis that and are from the same distribution becomes the hypothesis . The test statistic is
,
where and are the two sample means. The variable is the pooled estimate of the standard deviation and is given by
.
Under the null hypothesis, follows the T-distribution with degrees of freedom. We thus reject the null when exceeds the quantile of the T-distribution.
Pooling the data
We can turn this two sample test into a correlation test by pooling the data and using a linear model. Let be the pooled data and for , define by
The assumptions that and can be rewritten as
where . That is, we have expressed our modelling assumptions as a linear model. When working with this linear model, the hypothesis is equivalent to . To test we can use the standard t-test for a coefficient in linear model. The test statistic in this case is
where is the ordinary least squares estimate of , is the design matrix and is an estimate of given by
where is the fitted value of .
It turns out that is exactly equal to . We can see this by writing out the design matrix and calculating everything above. The design matrix has rows and is thus equal to
This implies that
And therefore,
Thus, . So,
which is starting to like from the two-sample test. Now
And so
Thus, and
This means to show that , we only need to show that . To do this, note that the fitted values are equal to
Thus,
Which is exactly . Therefore, and the two sample t-test is equivalent to a correlation test.
The Friedman-Rafsky test
In the above example, we saw that the two sample t-test was a special case of the t-test for regressions. This is neat but both tests make very strong assumptions about the data. However, the same thing happens in a more interesting non-parametric setting.
In their 1979 paper, Jerome Friedman and Lawrence Rafsky introduced a two sample tests that makes no assumptions about the distribution of the data. The two samples do not even have to real-valued and can instead be from any metric space. It turns out that their test is a special case of another procedure they devised for testing for association (Friedman & Rafsky, 1983). As with the t-tests above, this connection comes from pooling the two samples and introducing a dummy variable.
A Monte Carlo significance test of the null hypothesis requires creating independent samples . The idea is if and independently are i.i.d. from , then for any test statistic , the rank of among is uniformly distributed on . This means that if is one of the largest values of , then we can reject the hypothesis at the significance level .
The advantage of Monte Carlo significance tests is that we do not need an analytic expression for the distribution of under . By generating the i.i.d. samples , we are making an empirical distribution that approximates the theoretical distribution. However, sometimes sampling is just as intractable as theoretically studying the distribution of . Often approximate samples based on Markov chain Monte Carlo (MCMC) are used instead. However, these samples are not independent and may not be sampling from the true distribution. This means that a test using MCMC may not be statistically valid
In the 1989 paper Generalized Monte Carlo significance tests, Besag and Clifford propose two methods that solve this exact problem. Their two methods can be used in the same settings where MCMC is used but they are statistically valid and correctly control the Type 1 error. In this post, I will describe just one of the methods – the serial test.
Background on Markov chains
To describe the serial test we will need to introduce some notation. Let denote a transition matrix for a Markov chain on a discrete state space A Markov chain with transition matrix thus satisfies,
Suppose that the transition matrix has a stationary distribution . This means that if is a Markov chain with transition matrix and is distributed according to , then is also distributed according to . This implies that all are distributed according to .
We can construct a new transition matrix from and by . The transition matrix is called the reversal of . This is because for all and in , . That is the chance of drawing from and then transitioning to according to is equal to the chance of drawing from and then transitioning to according to
The new transition matrix also allows us to reverse longer runs of the Markov chain. Fix and let be a Markov chain with transition matrix and initial distribution . Also let be a Markov chain with transition matrix and initial distribution , then
,
where means equal in distribution.
The serial test
Suppose we want to test the hypothesis where is our observed data and is some distribution on . To conduct the serial test, we need to construct a Markov chain for which is a stationary distribution. We then also need to construct the reversal described above. There are many possible ways to construct such as the Metropolis-Hastings algorithm.
We also need a test statistic. This is a function which we will use to detect outliers. This function is the same function we would use in a regular Monte Carlo test. Namely, we want to reject the null hypothesis when is much larger than we would expect under .
The serial test then proceeds as follows. First we pick uniformly in and set . We then generate as a Markov chain with transition matrix that starts at . Likewise we generate as a Markov chain that starts from .
We then apply to each of and count the number of such that . If there are such , then the reported p-value of our test is .
We will now show that this test produces a valid p-value. That is, when , the probability that is less than is at most . In symbols,
Under the null hypothesis , is equal in distribution to generating from and using the transition matrix to go from to .
Thus, under the null hypothesis, the distribution of does not depend on . The entire procedure is equivalent to generating a Markov chain with initial distribution and transition matrix , and then choosing independently of . This is enough to show that the serial method produces valid p-values. The idea is that since the distribution of does not depend on and is uniformly distributed on , the probability that is in the top proportion of should be at most . This is proved more formally below.
For each , let be the event that is in the top proportion of . That is,
.
Let be the indicator function for the event . Since at must values of can be in the top fraction of , we have that
,
Therefor, by linearity of expectations,
By the law of total probability we have,
,
Since is uniform on , is for all . Furthermore, by independence of and , we have
.
Thus, by our previous bound,
.
Applications
In the original paper by Besag and Clifford, the authors discuss how this procedure can be used to perform goodness-of-fit tests. They construct Markov chains that can test the Rasch model or the Ising model. More broadly the method can be used to tests goodness-of-fit tests for any exponential family by using the Markov chains developed by Diaconis and Sturmfels.
A similar method has also been applied more recently to detect Gerrymandering. In this setting, the null hypothesis is the uniform distribution on all valid redistrictings of a state and the test statistics measure the political advantage of a given districting.
A few months ago, I had the pleasure of reading Eugenia Cheng‘s book How to Bake Pi. Each chapter starts with a recipe which Cheng links to the mathematical concepts contained in the chapter. The book is full of interesting connections between mathematics and the rest of the world.
One of my favourite ideas in the book is something Cheng writes about equations and the humble equals sign: . She explains that when an equation says two things are equal we very rarely mean that they are exactly the same thing. What we really mean is that the two things are the same in some ways even though they may be different in others.
One example that Cheng gives is the equation . This is such a familiar statement that you might really think that and are the same thing. Indeed, if and are any numbers, then the number you get when you calculate is the same as the number you get when you calculate . But calculating could be very different from calculating . A young child might calculate by starting with and then counting one-by-one from to . If is and is , then calculating requires counting from to but calculating simply amounts to counting from to . The first process takes way longer than the second and the child might disagree that is the same as .
In How to Bake Pi, Cheng explains that a crucial idea behind equality is context. When someone says that two things are equal we really mean that they are equal in the context we care about. Cheng talks about how context is crucial through-out mathematics and introduces a little bit of category theory as a tool for moving between different contexts. I think that this idea of context is really illuminating and I wanted to share some examples where “” doesn’t mean “exactly the same as”.
The Sherman-Morrison formula
The Sherman-Morrison formula is a result from linear algebra that says for any invertible matrix and any pair of vectors , if , then is invertible and
Here “” means the following:
You can take any natural number , any matrix of size by , and any length -vectors and that satisfy the above condition.
If you take all those things and carry out all the matrix multiplications, additions and inversions on the left and all the matrix multiplications, additions and inversions on the right, then you will end up with exactly the same matrix in both cases.
But depending on the context, the equation on one side of “” may be much easier than the other. Although the right hand side looks a lot more complicated, it is much easier to compute in one important context. This context is when we have already calculated the matrix and now want the inverse of . The left hand side naively computes which takes computations since we have to invert a matrix. On the right hand side, we only need to compute a small number of matrix-vector products and then add two matrices together. This bring the computational cost down to .
These cost saving measures come up a lot when studying linear regression. The Sherman-Morrison formula can be used to update regression coefficients when a new data point is added. Similarly, the Sherman-Morrison formula can be used to quickly calculate the fitted values in leave-one-out cross validation.
log-sum-exp
This second example also has connections to statistics. In a mixture model, we assume that each data point comes from a distribution of the form:
,
where is a vector and is equal to the probability that came from class . The parameters are the parameters for the group.The log-likelihood is thus,
,
where . We can see that the log-likelihood is of the form log-sum-exp. Calculating a log-sum-exp can cause issues with numerical stability. For instance if and , for all , then the final answer is simply . However, as soon as we try to calculate on a computer, we’ll be in trouble.
The solution is to use the following equality, for any ,
.
Proving the above identity is a nice exercise in the laws of logarithm’s and exponential’s, but with a clever choice of we can more safely compute the log-sum-exp expression. For instance, in the documentation for pytorch’s implementation of logsumexp() they take to be the maximum of . This (hopefully) makes each of the terms a reasonable size and avoids any numerical issues.
Again, the left and right hand sides of the above equation might be the same number, but in the context of having to use computers with limited precision, they represent very different calculations.
Beyond How to Bake Pi
Eugenia Cheng has recently published a new book called The Joy of Abstraction. I’m just over half way through and it’s been a really engaging and interesting introduction to category theory. I’m looking forward to reading the rest of it and getting more insight from Eugenia Cheng’s great mathematical writing.