A not so simple conditional expectation

It is winter 2022 and my PhD cohort has moved on the second quarter of our first year statistics courses. This means we’ll be learning about generalised linear models in our applied course, asymptotic statistics in our theory course and conditional expectations and martingales in our probability course.

In the first week of our probability course we’ve been busy defining and proving the existence of the conditional expectation. Our approach has been similar to how we constructed the Lebesgue integral in the previous course. Last quarter, we first defined the Lebesgue integral for simple functions, then we used a limiting argument to define the Lebesgue integral for non-negative functions and then finally we defined the Lebesgue integral for arbitrary functions by considering their positive and negative parts.

Our approach to the conditional expectation has been similar but the journey has been different. We again started with simple random variables, then progressed to non-negative random variables and then proved the existence of the conditional expectation of any arbitrary integrable random variable. Unlike the Lebesgue integral, the hardest step was proving the existence of the conditional expectation of a simple random variable. Progressing from simple random variables to arbitrary random variables was a straight forward application of the monotone convergence theorem and linearity of expectation. But to prove the existence of the conditional expectation of a simple random variable we needed to work with projections in the Hilbert space L^2(\Omega, \mathbb{P},\mathcal{F}).

Unlike the Lebesgue integral, defining the conditional expectation of a simple random variable is not straight forward. One reason for this is that the conditional expectation of a random variable need not be a simple random variable. This comment was made off hand by our Professor and sparked my curiosity. The following example is what I came up with. Below I first go over some definitions and then we dive into the example.

A simple random variable with a conditional expectation that is not simple

Let (\Omega, \mathbb{P}, \mathcal{F}) be a probability space and let \mathcal{G} \subseteq \mathcal{F} be a sub-\sigma-algebra. The conditional expectation of an integrable random variable X is a random variable \mathbb{E}(X|\mathcal{G}) that satisfies the following two conditions:

  1. The random variable \mathbb{E}(X|\mathcal{G}) is \mathcal{G}-measurable.
  2. For all B \in \mathcal{G}, \mathbb{E}[X1_B] = \mathbb{E}[\mathbb{E}(X|\mathcal{G})1_B], where 1_B is the indicator function of B.

The conditional expectation of an integrable random variable is unique and always exists. One can think of \mathbb{E}(X|\mathcal{G}) as the expected value of X given the information in \mathcal{G}.

A simple random variable is a random variable X that take only finitely many values. Simple random variables are always integrable and so \mathbb{E}(X|\mathcal{G}) always exists but we will see that \mathbb{E}(X|\mathcal{G}) need not be simple.

Consider a random vector (U,V) uniformly distributed on the square [-1,1]^2 \subseteq \mathbb{R}^2. Let D be the unit disc D = \{(u,v) \in \mathbb{R}^2:u^2+v^2 \le 1\}. The random variable X = 1_D(U,V) is a simple random variable since X equals 1 if (U,V) \in D and X equals 0 otherwise. Let \mathcal{G} = \sigma(U) the \sigma-algebra generated by U. It turns out that

\mathbb{E}(X|\mathcal{G}) = \sqrt{1-U^2}.

Thus \mathbb{E}(X|\mathcal{G}) is not a simple random variable. Let Y = \sqrt{1-U^2}. Since Y is a continuous function of U, the random variable is \mathcal{G}-measurable. Thus Y satisfies condition 1. Furthermore if B \in \mathcal{G}, then B = \{U \in A\} for some measurable set A\subseteq [-1,1]. Thus X1_B equals 1 if and only if U \in A and V \in [-\sqrt{1-U^2}, \sqrt{1-U^2}]. Since (U,V) is uniformly distributed we thus have

\mathbb{E}[X1_B] = \int_A \int_{-\sqrt{1-u^2}}^{\sqrt{1-u^2}} \frac{1}{4}dvdu = \int_A \frac{1}{2}\sqrt{1-u^2}du.

The random variable U is uniformly distributed on [-1,1] and thus has density \frac{1}{2}1_{[-1,1]}. Therefore,

\mathbb{E}[Y1_B] = \mathbb{E}[\sqrt{1-U^2}1_{\{U \in A\}}] = \int_A \frac{1}{2}\sqrt{1-u^2}du.

Thus \mathbb{E}[X1_B] = \mathbb{E}[Y1_B] and therefore Y = \sqrt{1-U^2} equals \mathbb{E}(X|\mathcal{G}). Intuitively we can see this because given U=u, we know that X is 1 when V \in [-\sqrt{1-u^2},\sqrt{1+u^2}] and that X is 0 otherwise. Since V is uniformly distributed on [-1,1] the probability that V is in [-\sqrt{1-u^2},\sqrt{1+u^2}] is \sqrt{1-u^2}. Thus given U=u, the expected value of X is \sqrt{1-u^2}.

An extension

The previous example suggests an extension that shows just how “complicated” the conditional expectation of a simple random variable can be. I’ll state the extension as an exercise:

Let f:[-1,1]\to \mathbb{R} be any continuous function with f(x) \in [0,1]. With (U,V) and \mathcal{G} as above show that there exists a measurable set A \subseteq [-1,1]^2 such that \mathbb{E}(1_A|\mathcal{G}) = f(U).

Extremal couplings

This post is inspired by an assignment question I had to answer for STATS 310A – a probability course at Stanford for first year students in the statistics PhD program. In the question we had to derive a few results about couplings. I found myself thinking and talking about the question long after submitting the assignment and decided to put my thoughts on paper. I would like to thank our lecturer Prof. Diaconis for answering my questions and pointing me in the right direction.

What are couplings?

Given two distribution functions F and G on \mathbb{R}, a coupling of F and G is a distribution function H on \mathbb{R}^2 such that the marginals of H are F and G. Couplings can be used to give probabilistic proofs of analytic statements about F and G (see here). Couplings are also are studied in their own right in the theory optimal transport.

We can think of F and G as being the cumulative distribution functions of some random variables X and Y. A coupling H of F and G thus corresponds to a random vector (\widetilde{X},\widetilde{Y}) where \widetilde{X} has the same distribution as X, \widetilde{Y} has the same distribution as Y and (\widetilde{X},\widetilde{Y})  \sim H.

The independent coupling

For two given distributions function F and G there exist many possible couplings. For example we could take H = H_I where H_I(x,y) = F(x)G(y). This coupling corresponds to a random vector (\widetilde{X}_I,\widetilde{Y}_I) where \widetilde{X}_I and \widetilde{Y}_I are independent and (as is required for all couplings) \widetilde{X}_I \stackrel{\text{dist}}{=} X, \widetilde{Y}_I \stackrel{\text{dist}}{=} Y.

In some sense the coupling H_I is in the “middle” of all couplings. This is because \widetilde{X} and \widetilde{Y} are independent and so \widetilde{X} doesn’t carry any information about \widetilde{Y}. As the title of the post suggests, there are couplings were this isn’t the case and \widetilde{X} carries “as much information as possible” about \widetilde{Y}.

The two extremal couplings

Define two function H_L, H_U :\mathbb{R}^2 \to [0,1] by

H_U(x,y) = \min\{F(x), G(y)\} and H_L(x,y) = \max\{F(x)+G(y) - 1, 0\}.

With some work, one can show that H_L and H_U are distributions functions on \mathbb{R}^2 and that they have the correct marginals. In this post I would like to talk about how to construct random vectors (\widetilde{X}_U, \widetilde{Y}_U) \sim H_U and (\widetilde{X}_L, \widetilde{Y}_L) \sim H_L.

Let F^{-1} and G^{-1} be the quantile functions of F and G. That is,

F^{-1}(c) = \inf\{ x \in \mathbb{R} : F(x) \ge c\} and G^{-1}(c) = \inf\{ x \in \mathbb{R} : G(x) \ge c\}.

Now let V be a random variable that is uniformly distributed on [0,1] and define

\widetilde{X}_U = F^{-1}(V) and \widetilde{Y}_U = G^{-1}(V).

Since F^{-1}(V) \le x if and only if V \le F(x), we have \widetilde{X}_U \stackrel{\text{dist}}{=} X and likewise \widetilde{Y}_U \stackrel{\text{dist}}{=} Y. Furthermore \widetilde{X}_U \le x, \widetilde{Y}_U \le y occurs if and only if V \le F(x), V \le G(y) which is equivalent to V \le \min\{F(x),G(y)\}. Thus

\mathbb{P}(\widetilde{X}_U \le x, \widetilde{Y}_U \le y) = \mathbb{P}(V \le \min\{F(x),G(y)\})= \min\{F(x),G(y)\}.

Thus (\widetilde{X}_U,\widetilde{Y}_U) is distributed according to H_U. We see that under the coupling H_U, \widetilde{X}_U and \widetilde{Y}_U are closely related as they are both increasing functions of a common random variable V.

We can follow a similar construction for H_L. Define

\widetilde{X}_L = F^{-1}(V) and \widetilde{Y}_L = G^{-1}(1-V).

Thus \widetilde{X}_L and \widetilde{Y}_L are again functions of a common random variable V but \widetilde{X}_L is an increasing function of V and \widetilde{Y}_L is a decreasing function of V. Note that 1-V is also uniformly distributed on [0,1]. Thus \widetilde{X}_L \stackrel{\text{dist}}{=} X and \widetilde{Y}_L \stackrel{\text{dist}}{=} Y.

Now \widetilde{X}_L \le x, \widetilde{Y}_L \le y occurs if and only if V \le F(x) and 1-V \le G(y) which occurs if and only if 1-G(y) \le V \le F(x). If 1-G(y) \le F(x), then F(x)+G(y)-1 \ge 0 and \mathbb{P}(1-G(y) \le V \le F(x)) =F(x)+G(y)-1. On the other hand, if 1 - G(y) > F(x), then F(x)+G(y)-1< 0 and \mathbb{P}(1-G(y) \le V \le F(x))=0. Thus

\mathbb{P}(\widetilde{X}_L \le x, \widetilde{Y}_L \le y) = \mathbb{P}(1-G(y) \le V \le F(x)) = \max\{F(x)+G(y)-1,0\},

and so (\widetilde{X}_L,\widetilde{Y}_L) is distributed according to H_L.

What makes H_U and H_L extreme?

Now that we know that H_U and H_L are indeed couplings, it is natural to ask what makes them “extreme”. What we would like to say is that \widetilde{Y}_U is an increasing function of \widetilde{X}_U and \widetilde{Y}_L is a decreasing function of \widetilde{X}_L. Unfortunately this isn’t always the case as can be seen by taking X to be constant and Y to be continuous.

However the intuition that \widetilde{Y}_U is increasing in \widetilde{X}_U and \widetilde{Y}_L is decreasing in \widetilde{X}_L is close to correct. Given a coupling (\widetilde{X},\widetilde{Y}) \sim H, we can look at the quantity

C(x,y) = \mathbb{P}(\widetilde{Y} \le y | \widetilde{X} \le x) -\mathbb{P}(\widetilde{Y} \le y) = \frac{H(x,y)}{F(x)}-G(y)

This quantity tells us something about how \widetilde{Y} changes with \widetilde{X}. For instance if \widetilde{X} and \widetilde{Y} were positively correlated, then C(x,y) would be positive and if \widetilde{X} and \widetilde{Y} were negatively correlated, then C(x,y) would be negative.

For the independent coupling (\widetilde{X}_I,\widetilde{Y}_I) \sim H_I, the quantity C(x,y) is constantly 0. It turns out that the above probability is maximised by the coupling (\widetilde{X}_U, \widetilde{Y}_U) \sim H_U and minimised by (\widetilde{X}_L,\widetilde{Y}_L) \sim H_L and it is in this sense that they are extremal. This final claim is the two dimensional version of the Fréchet-Hoeffding Theorem and checking it is a good exercise.

An art and maths collaboration

Over the course of the past year I have had the pleasure to work with the artist Sanne Carroll on her honours project at the Australian National University. I was one of two mathematics students that collaborated with Sanne. Over the course of the project Sanne drew patterns and would ask Ciaran and I to recreate them using some mathematical or algorithmic ideas. You can see the final version of project here: https://www.sannecarroll.com/ (best viewed on a computer).

I always loved the patterns Sanne drew and the final project is so well put together. Sanne does a great job of incorporating her drawings, the mathematical descriptions and the communication between her, Ciaran and me. Her website building skills also far surpass anything I’ve done on this blog!

It was also a lot of fun to work with Sanne. Hearing about her patterns and talking about maths with her was always fun. I also learnt a few things about GeoGebra which made the animations in my previous post a lot quicker to make. Sanne has told me that she’ll be starting a PhD soon and I’m looking forward to any future collaborations that might arise.

Finitely Additive Measures

I am again tutoring the course MATH3029. The content is largely the same but the name has changed from “Probability Modelling and Applications” to “Probability Theory and Applications” to better reflect the material taught. There was a good question on the first assignment that leads to some interesting mathematics.

The Question

The assignment question is as follows. Let \Omega be a set and let \mathcal{F} \subseteq \mathcal{P}(\Omega) be a \sigma-algebra on \Omega. Let \mathbb{P} : \mathcal{F} \to [0,\infty) be a function with the following properties

  1. \mathbb{P}(\Omega) = 1.
  2. For any finite sequence of pairwise disjoint sets (A_k)_{k=1}^n in \mathcal{F}, we have \mathbb{P}\left(\bigcup_{k=1}^n A_k \right) = \sum_{k=1}^n \mathbb{P}(A_k).
  3. If (B_n)_{n=1}^\infty is a sequence of sets in \mathcal{F} such that B_{n+1} \subseteq B_n for all n \in \mathbb{N} and \bigcap_{n=1}^\infty A_n = \emptyset, then, as n tends to infinity, \mathbb{P}(A_n) \to 0.

Students were then asked to show that the function \mathbb{P} is a probability measure on (\Omega, \mathcal{F}). This amounts to showing that \mathbb{P} is countably additive. That is if (A_k)_{k=1}^\infty is a sequence of pairwise disjoint sets, then \mathbb{P}\left(\cup_{k=1}^\infty A_k\right) = \sum_{k=1}^\infty \mathbb{P}(A_k). One way to do this is define B_n = \bigcup_{k=n+1}^\infty A_k. Since the sets (A_k)_{k=1}^\infty are pairwise disjoint, the sets (B_n)_{n=1}^\infty satisfy the assumptions of the third property of \mathbb{P}. Thus we can conclude that \mathbb{P}(B_n) \to 0 as n \to \infty.

We also have that for every n \in \mathbb{N} we have \bigcup_{k=1}^\infty A_k = \left(\bigcup_{k=1}^n A_k\right) \cup B_n. Thus by applying the second property of \mathbb{P} twice we get

\mathbb{P}\left(\bigcup_{k=1}^\infty A_k \right) = \mathbb{P}\left( \bigcup_{k=1}^n A_k\right) + \mathbb{P}(B_n) = \sum_{k=1}^n \mathbb{P}(A_k) + \mathbb{P}(B_n).

If we let n tend to infinity, then we get the desired result.

A Follow Up Question

A natural follow up question is whether all three of the assumptions in the question are necessary. It is particularly interesting to ask if there is an example of a function \mathbb{P} that satisfies the first two properties but is not a probability measure. It turns out the answer is yes but coming up with an example involves some serious mathematics.

Let \Omega be the set of natural numbers \mathbb{N} = \{1,2,3,\ldots\} and let \mathcal{F} be the power set of \mathbb{N}.

One way in which people talk about the size of a subset of natural numbers A \subseteq \mathbb{N} is to look at the proportion of elements in A and take a limit. That is we could define

\mathbb{P}(A) = \lim_{n \to \infty}\frac{|\{k \in A \mid k \le n \}|}{n}.

This function \mathbb{P} has some nice properties for instance if A is the set of all even numbers than \mathbb{P}(A) = 1/2. More generally if A is the set of all numbers divisible by k, then \mathbb{P}(A) = 1/k. The function \mathbb{P} gets used a lot. When people say that almost all natural numbers satisfy a property, they normally mean that if A is the subset of all numbers satisfying the property, then \mathbb{P}(A)=1.

However the function \mathbb{P} is not a probability measure. The function \mathbb{P} is finitely additive. To see this, let (A_i)_{i=1}^m be a finite collection of disjoint subsets of \mathbb{N} and let A = \bigcup_{i=1}^m A_i. Then for every natural number n,

\{k \in A \mid k \le n \} = \bigcup_{i=1}^m \{k \in A_i \mid k \le n\}.

Since the sets (A_i)_{i=1}^m are disjoint, the union on the right is a disjoint union. Thus we have

\frac{|\{k \in A \mid k \le n \}|}{n} = \sum_{i=1}^m \frac{|\{k \in A_i \mid k \le n \}|}{n}.

Taking limits on both sides gives \mathbb{P}(A)=\sum_{i=1}^m \mathbb{P}(A_i), as required. Furthermore, the function \mathbb{P} is not countably additive. For instance if we let A_i = \{i\} for each i \in \mathbb{N}. Then \bigcup_{i=1}^\infty A_i = \mathbb{N} and \mathbb{P}(\mathbb{N})=1. On the other hand \mathbb{P}(A_i)=0 for every i \in \mathbb{N} and hence \sum_{i=1}^\infty \mathbb{P}(A_i)=0\neq \mathbb{P}(\mathbb{N}).

Thus it would appear that we have an example of a finitely additive measure that is not countably additive. However there is a big problem with the above definition of \mathbb{P}. Namely the limit of \frac{|\{k \in A \mid k \le n \}|}{n} does not always exist. Consider the set A = \{3,4,9,10,11,12,13,14,15,16,33,\ldots\}, ie a number k \in \mathbb{N} is in A if and only if 2^{m} < k \le 2^{m+1} for some odd number m\ge 1. The idea with the set A is that it looks a little bit like this:

There are chunks of numbers that alternate between being in A and not being in A and as we move further along, these chunks double in size. Let a_n represent the sequence of numbers \frac{|\{k \in A \mid k \le n \}|}{n}. We can see that a_n increases while n is in a chunk that belongs to A and decreases when n is in a chunk not in A. More specifically if 2^{2m-1} < n \le 2^{2m}, then a_n is increasing but if 2^{2m} < n \le 2^{2m+1}, then a_n is decreasing.

At the turning points n = 2^{2m} or n = 2^{2m+1} we can calculate exactly what a_n is equal to. Note that

\{k \in A \mid k \le 2^{2m} \} = 2+8+32+\ldots+2\cdot 4^{m-1} = 2\cdot \frac{4^m-1}{4-1} = \frac{2}{3}(4^m-1).

Furthermore since there are no elements of A between 2^{2m} and 2^{2m+1} we have

\{k \in A \mid k \le 2^{2m+1}\} = \frac{2}{3}(4^m-1).

Thus we have

a_{2^m} = \frac{\frac{2}{3}(4^m-1)}{2^{2m}}=\frac{2}{3}\frac{4^m-1}{4^m} and a_{2m+1} = \frac{\frac{2}{3}(4^m-1)}{2^{2m+1}}=\frac{1}{3}\frac{4^m-1}{4^m}.

Hence the values a_n fluctuate between approaching \frac{1}{3} and \frac{2}{3}. Thus the limit of a_n does not exist and hence \mathbb{P} is not well-defined.

There is a work around using something called a Banach limit. Banach limits are a way of extending the notion of a limit from the space of convergent sequences to the space of bounded sequences. Banach limits aren’t uniquely defined and don’t have a formula describing them. Indeed to prove the existence of Banach limits one has to rely on non-constructive mathematics such as the Hanh-Banach extension theorem. So if we take for granted the existence of Banach limits we can define

\mathbb{P}(A) = L\left( \left(\frac{|\{k \in A \mid k \le n \}|}{n}\right)_{n=1}^\infty \right),

where L is now a Banach limit. This new definition of \mathbb{P} is defined on all subsets of \mathbb{N} and is an example of measure that is finitely additive but not countably additive. However the definition of \mathbb{P} is very non-constructive. Indeed there are models of ZF set theory where the Hanh-Banach theorem does not hold and we cannot prove the existence of Banach limits.

This begs the question of whether or not there exist constructible examples of measures that are finitely additive but not countably additive. A bit of Googling reveals that non-principal ultrafilters provide another way of defining non-countably additive measures. However the existence of a non-principal ultrafilter on \mathbb{N} is again equivalent to a weak form of the axiom of choice. Thus it seems that the existence of a non-countably additive measure may be inherently non-constructive. This discussion on Math Overflow goes into more detail.

Why is the fundamental theorem of arithmetic a theorem?

The fundamental theorem of arithmetic states that every natural number can be factorized uniquely as a product of prime numbers. The word “uniquely” here means unique up to rearranging. The theorem means that if you and I take the same number n and I write n = p_1p_2\ldots p_k and you write n = q_1q_2\ldots q_l where each p_i and q_i is a prime number, then in fact k=l and we wrote the same prime numbers (but maybe in a different order).

Most people happily accept this theorem as self evident and believe it without proof. Indeed some people take it to be so self evident they feel it doesn’t really deserve the name “theorem” – hence the title of this blog post. In this post I want to highlight two situations where an analogous theorem fails.

Situation One: The Even Numbers

Imagine a world where everything comes in twos. In this world nobody knows of the number one or indeed any odd number. Their counting numbers are the even numbers \mathbb{E} = \{2,4,6,8,\ldots\}. People in this world can add numbers and multiply numbers just like we can. They can even talk about divisibility, for example 2 divides 8 since 8 = 4\cdot 2. Note that things are already getting a bit strange in this world. Since there is no number one, numbers in this world do not divide themselves.

Once people can talk about divisibility, they can talk about prime numbers. A number is prime in this world if it is not divisible by any other number. For example 2 is prime but as we saw 8 is not prime. Surprisingly the number 6 is also prime in this world. This is because there are no two even numbers that multiply together to make 6.

If a number is not prime in this world, we can reduce it to a product of primes. This is because if n is not prime, then there are two number a and b such that n = ab. Since a and b are both smaller than n, we can apply the same argument and recursively write n as a product of primes.

Now we can ask whether or not the fundamental theorem of arthimetic holds in this world. Namely we want to know if their is a unique way to factorize each number in this world. To get an idea we can start with some small even numbers.

  • 2 is prime.
  • 4 = 2 \cdot 2 can be factorized uniquely.
  • 6 is prime.
  • 8  = 2\cdot 2 \cdot 2 can be factorized uniquely.
  • 10 is prime.
  • 12 = 2 \cdot 6 can be factorized uniquely.
  • 14 is prime.
  • 16 = 2\cdot 2 \cdot 2 \cdot 2 can be factorized uniquely.
  • 18 is prime.
  • 20 = 2 \cdot 10 can be factorized uniquely.

Thus it seems as though there might be some hope for this theorem. It at least holds for the first handful of numbers. Unfortunately we eventually get to 36 and we have:

36 = 2 \cdot 18 and 36 = 6 \cdot 6.

Thus there are two distinct ways of writing 36 as a product of primes in this world and thus the fundamental theorem of arithmetic does not hold.

Situtation Two: A Number Ring

While the first example is fun and interesting, it is somewhat artificial. We are unlikely to encounter a situation where we only have the even numbers. It is however common and natural for mathematicians to be lead into certain worlds called number rings. We will see one example here and see what an effect the fundamental theorem of arithmetic can have.

Consider wanting to solve the equation x^2+19=y^3 where x and y are both integers. One way to try to solve this is by rewriting the equation as (x+\sqrt{-19})(x-\sqrt{-19}) = y^3. With this rewriting we have left the familiar world of the whole numbers and entered the number ring \mathbb{Z}[\sqrt{-19}].

In \mathbb{Z}[\sqrt{-19}] all numbers have the form a + b \sqrt{-19}, where a and b are integers. Addition of two such numbers is defined like so

(a+b\sqrt{-19}) + (c + d \sqrt{-19}) = (a+c) + (b+d)\sqrt{-19}.

Multiplication is define by using the distributive law and the fact that \sqrt{-19}^2 = -19. Thus

(a+b\sqrt{-19})(c+d\sqrt{-19}) = (ac-19bd) + (ad+bc)\sqrt{-19}.

Since we have multiplication we can talk about when a number in \mathbb{Z}[\sqrt{-19}] divides another and hence define primes in \mathbb{Z}[\sqrt{-19}]. One can show that if x^2 + 19 = y^3, then x+\sqrt{-19} and x-\sqrt{-19} are coprime in \mathbb{Z}[\sqrt{-19}] (see the references at the end of this post).

This means that there are no primes in \mathbb{Z}[\sqrt{-19}] that divides both x+\sqrt{-19} and x-\sqrt{-19}. If we assume that the fundamental theorem of arthimetic holds in \mathbb{Z}[\sqrt{-19}], then this implies that x+\sqrt{-19} must itself be a cube. This is because (x+\sqrt{-19})(x-\sqrt{-19})=y^3 is a cube and if two coprime numbers multiply to be a cube, then both of those coprime numbers must be cubes.

Thus we can conclude that there are integers a and b such that x+\sqrt{-19} = (a+b\sqrt{-19})^3 . If we expand out this cube we can conclude that

x+\sqrt{-19} = (a^3-57ab^2)+(3a^2b-19b^3)\sqrt{-19}.

Thus in particular we have 1=3a^2b-19b^3=(3a^2-19b^2)b. This implies that b = \pm 1 and 3a^2-19b^2=\pm 1. Hence b^2=1 and 3a^2-19 = \pm 1. Now if 3a^2 -19 =-1, then a^2=6 – a contradiction. Similarly if 3a^2-19=1, then 3a^2=20 – another contradiction. Thus we can conclude there are no integer solutions to the equation x^2+19=y^3!

Unfortunately however, a bit of searching reveals that 18^2+19=343=7^3. Thus simply assuming that that the ring \mathbb{Z}[\sqrt{-19}] has unique factorization led us to incorrectly conclude that an equation had no solutions. The question of unique factorization in number rings such as \mathbb{Z}[\sqrt{-19}] is a subtle and important one. Some of the flawed proofs of Fermat’s Last Theorem incorrectly assume that certain number rings have unique factorization – like we did above.

References

The lecturer David Smyth showed us that the even integers do not have unique factorization during a lecture of the great course MATH2222.

The example of \mathbb{Z}[\sqrt{-19}] failing to have unique factorization and the consequences of this was shown in a lecture for a course on algebraic number theory by James Borger. In this class we followed the (freely available) textbook “Number Rings” by P. Stevenhagen. Problem 1.4 on page 8 is the example I used in this post. By viewing the textbook you can see a complete solution to the problem.

Polynomial Pairing Functions

One of the great results of the 19th century German mathematician Georg Cantor is that the sets \mathbb{N} and \mathbb{N} \times \mathbb{N} have the same cardinality. That is the set of all non-negative integers \mathbb{N} = \{0,1,2,3,\ldots\} has the same size as the set of all pairs on non-negative integers \mathbb{N} \times \mathbb{N} = \{(m,n) \mid m,n \in \mathbb{N} \}, or put less precisely “infinity times infinity equals infinity”.

Proving this result amounts to finding a bijection p : \mathbb{N} \times \mathbb{N} \to \mathbb{N}. We will call such a function a pairing function since it takes in two numbers and pairs them together to create a single number. An example of one such function is

p(m,n) = 2^m(2n+1)-1

This function is a bijection because every positive whole number can be written uniquely as the product of a power of two and an odd number. Such functions are of practical as well as theoretical interest. Computers use pairing functions to store matrices and higher dimensional arrays. The entries of the matrix are actually stored in a list. When the user gives two numbers corresponding to a matrix entry, the computer uses a pairing function to get just one number which gives the index of the corresponding entry in the list. Thus having efficiently computable pairing functions is of practical importance.

Representing Pairing Functions

Our previous example of a pairing function was given by a formula. Another way to define pairing functions is by first representing the set \mathbb{N} \times \mathbb{N} as a grid with an infinite number of rows and columns like so:

We can then represent a pairing function as a path through this grid that passes through each square exactly once. Here are two examples:

The way to go from one of these paths to a function from \mathbb{N} \times \mathbb{N} to \mathbb{N} is as follows. Given as input a pair of integers (m,n), first find the dot that (m,n) represents in the grid. Next count the number of backwards steps that need to be taken to get to the start of the path and then output this number.

We can also do the reverse of the above procedure. That is, given a pairing function p : \mathbb{N} \times \mathbb{N} \to \mathbb{N}, we can represent p as a path in the grid. This is done by starting at p^{-1}(0) and joining p^{-1}(n) to p^{-1}(n+1). It’s a fun exercise to work out what the path corresponding to p(m,n) = 2^m(2n+1)-1 looks like.

Cantor’s Pairing Function

The pairing function that Cantor used is not any of the ones we have seen so far. Cantor used a pairing function which we will call q. When represented as a path, this is what q looks like:

Surprisingly there’s a simple formula that represents this pairing function q which we will now derive. First note that if we are at a point (k,0), then the value of q(k,0) is 1+2+3+\ldots+(k-1)+k= \frac{1}{2}k(k+1). This is because to get to (k,0) from (0,0) = q^{-1}(0), we have to go along k diagonals which each increase in length.

Now let (m,n) be an arbitrary pair of integers and let k = m+n. The above path first goes through (k,0) and then takes n steps to get to (m,n). Thus

q(m,n)= q(k,0)+n=\frac{1}{2}k(k+1)+n=\frac{1}{2}(n+m)(n+m+1)+n

And so Cantor’s pairing function is actually a quadratic polynomial in two variables!

Other Polynomial Pairing Functions?

Whenever we have a pairing function p, we can switch the order of the inputs and get a new pairing function \widetilde{p}. That is the function \widetilde{p} is given by \widetilde{p}(m,n)=p(n,m). When thinking of pairing functions as paths in a grid, this transformation amounts to reflecting the picture along the diagonal m = n.

Thus there are at least two quadratic pairing functions, Cantor’s function q and its switched cousin \widetilde{q}. The Fueter–Pólya theorem states these two are actually the only quadratic pairing functions! In fact it is conjectured that these two quadratics are the only polynomial pairing functions but this is still an open question.

Thank you to Wikipedia!

I first learnt that the sets \mathbb{N} and \mathbb{N} \times \mathbb{N} have the same cardinality in class a number of years ago. I only recently learnt about Cantor’s polynomial pairing function and the Fueter–Pólya theorem by stumbling across the Wikipedia page for pairing functions. Wikipedia is a great source for discovering new mathematics and for checking results. I use Wikipedia all the time. Many of these blog posts were initially inspired by Wikipedia entries.

Currently, Wikipedia is doing their annual fundraiser. If you are a frequent user of Wikipedia like me, I’d encourage you to join me in donating a couple of dollars to them: https://donate.wikimedia.org.

A minimal counterexample in probability theory

Last semester I tutored the course Probability Modelling with Applications. In this course the main objects of study are probability spaces. A probability space is a triple (\Omega, \mathcal{F}, \mathbb{P}) where:

  1. \Omega is a set.
  2. \mathcal{F} is a \sigma-algebra on \Omega. That is \mathcal{F} is a collection of subsets of \Omega such that \Omega \in \mathcal{F} and \mathcal{F} is closed under set complements and countable unions. The element of \mathcal{F} are called events and they are precisely the subsets of \Omega that we can assign probabilities to. We will denote the power set of \Omega by 2^\Omega and hence \mathcal{F} \subseteq 2^\Omega.
  3. \mathbb{P} is a probability measure. That is it is a function \mathbb{P}: \mathcal{F} \rightarrow [0,1] such that \mathbb{P}(\Omega)=1 and for all countable collections \{A_i\}_{i=1}^\infty \subseteq \mathcal{F} of mutually disjoint subsets we have that \mathbb{P} \left(\bigcup_{i=1}^\infty A_i \right) = \sum_{i=1}^\infty \mathbb{P}(A_i) .

It’s common for students to find probability spaces, and in particular \sigma-algebras, confusing. Unfortunately Vitalli showed that \sigma-algebras can’t be avoided if we want to study probability spaces such as \mathbb{R} or an infinite number of coin tosses. One of the main reasons why \sigma-algebras can be so confusing is that it can be very hard to give concrete descriptions of all the elements of a \sigma-algebra.

We often have a collection \mathcal{G} of subsets of \Omega that we are interested in but this collection fails to be a \sigma-algebra. For example, we might have \Omega = \mathbb{R}^n and \mathcal{G} is the collection of open subsets. In this situation we take our \sigma-algebra \mathcal{F} to be \sigma(\mathcal{G}) which is the smallest \sigma-algebra containing \mathcal{G}. That is

\sigma(\mathcal{G}) = \bigcap \mathcal{F}'

where the above intersection is taken over all \sigma-algebras \mathcal{F}' that contain \mathcal{G}. In this setting we will say that \mathcal{G} generates \sigma(\mathcal{G}). When we have such a collection of generators, we might have an idea for what probability we would like to assign to sets in \mathcal{G}. That is we have a function \mathbb{P}_0 : \mathcal{G} \rightarrow [0,1] and we want to extend this function to create a probability measure \mathbb{P} : \sigma(\mathcal{G}) \rightarrow [0,1]. A famous theorem due to Caratheodory shows that we can do this in many cases.

An interesting question is whether the extension \mathbb{P} is unique. That is does there exists a probability measure \mathbb{P}' on \sigma(\mathcal{G}) such that \mathbb{P} \neq \mathbb{P}' but \mathbb{P}_{\mid \mathcal{G}} = \mathbb{P}_{\mid \mathcal{G}}'? The following theorem gives a criterion that guarantees no such \mathbb{P}' exists.

Theorem: Let \Omega be a set and let \mathcal{G} be a collection of subsets of \Omega that is closed under finite intersections. Then if \mathbb{P},\mathbb{P}' : \sigma(\mathcal{G}) \rightarrow [0,1] are two probability measures such that \mathbb{P}_{\mid \mathcal{G}} = \mathbb{P}'_{\mid \mathcal{G}}, then \mathbb{P} = \mathbb{P}'.

The above theorem is very useful for two reasons. Firstly it can be combined with Caratheodory’s extension theorem to uniquely define probability measures on a \sigma-algebra by specifying the values on a collection of simple subsets \mathcal{G}. Secondly if we ever want to show that two probability measures are equal, the above theorem tells us we can reduce the problem to checking equality on the simpler subsets in \mathcal{G}.

The condition that \mathcal{G} must be closed under finite intersections is somewhat intuitive. Suppose we had A,B \in \mathcal{G} but A \cap B \notin \mathcal{G}. We will however have A \cap B \in \sigma(\mathcal{G}) and thus we might be able to find two probability measure \mathbb{P},\mathbb{P}' : \sigma(\mathcal{G}) \rightarrow [0,1] such that \mathbb{P}(A) = \mathbb{P}'(A) and \mathbb{P}'(B)=\mathbb{P}'(B) but \mathbb{P}(A \cap B) \neq \mathbb{P}'(A \cap B). The following counterexample shows that this intuition is indeed well-founded.

When looking for examples and counterexamples, it’s good to try to keep things as simple as possible. With that in mind we will try to find a counterexample when \Omega is finite set with as few elements as possible and \sigma(\mathcal{G}) is equal to the powerset of \Omega. In this setting, a probability measure \mathbb{P}: \sigma(\mathcal{G}) \rightarrow [0,1] can be defined by specifying the values \mathbb{P}(\{\omega\}) for each \omega \in \Omega.

We will now try to find a counterexample when \Omega is as small as possible. Unfortunately we won’t be able find a counterexample when \Omega only contains one or two elements. This is because we want to find A,B \subseteq \Omega such that A \cap B is not equal to A,B or \emptyset.

Thus we will start out search with a three element set \Omega = \{a,b,c\}. Up to relabelling the elements of \Omega, the only interesting choice we have for \mathcal{G} is \{  \{a,b\} , \{b,c\} \}. This has a chance of working since \mathcal{G} is not closed under intersection. However any probability measure \mathbb{P} on \sigma(\mathcal{G}) = 2^{\{a,b,c\}} must satisfy the equations

  1. \mathbb{P}(\{a\})+\mathbb{P}(\{b\})+\mathbb{P}(\{c\}) = 1,
  2. \mathbb{P}(\{a\})+\mathbb{P} (\{b\}) = \mathbb{P}(\{a,b\}),
  3. \mathbb{P}(\{b\})+\mathbb{P}(\{c\}) = \mathbb{P}(\{b,c\}).

Thus \mathbb{P}(\{a\}) = 1- \mathbb{P}(\{b,c\}), \mathbb{P}(\{c\}) = 1-\mathbb{P}(\{a,b\}) and \mathbb{P}(\{b\})=\mathbb{P}(\{a,b\})+\mathbb{P}(\{b,c\})-1. Thus \mathbb{P} is determined by its values on \{a,b\} and \{b,c\}.

However, a four element set \{a,b,c,d\} is sufficient for our counter example! We can let \mathcal{G} = \{\{a,b\},\{b,c\}\}. Then \sigma(\mathcal{G})=2^{\{a,b,c,d\}} and we can define \mathbb{P} , \mathbb{P}' : \sigma (\mathcal{G}) \rightarrow [0,1] by

  • \mathbb{P}(\{a\}) = 0, \mathbb{P}(\{b\})=0.5, \mathbb{P}(\{c\})=0 and \mathbb{P}(\{d\})=0.5.
  • \mathbb{P}'(\{a\})=0.5, \mathbb{P}(\{b\})=0, \mathbb{P}(\{c\})=0.5 and \mathbb{P}(\{d\})=0.

Clearly \mathbb{P} \neq \mathbb{P}' however \mathbb{P}(\{a,b\})=\mathbb{P}'(\{a,b\})=0.5 and \mathbb{P}(\{b,c\})=\mathbb{P}'(\{b,c\})=0.5. Thus we have our counterexample! In general for any \lambda \in [0,1) we can define the probability measure \mathbb{P}_\lambda = \lambda\mathbb{P}+(1-\lambda)\mathbb{P}'. The measure \mathbb{P}_\lambda is not equal to \mathbb{P} but agrees with \mathbb{P} on \mathcal{G}. In general, if we have two probability measures that agree on \mathcal{G} but not on \sigma(\mathcal{G}) then we can produce uncountably many such measures by taking convex combinations as done above.

[Link Post] Complexity Penalties in Statistical Machine Learning

Earlier in the year I wrote a post on Less Wrong about some of the material I learnt at the 2019 AMSI Summer School. You can read it here. On a related note, applications are open for the 2020 AMSI Summer School at La Trobe University. I highly recommend attending!

Combing groups

Two weeks ago I gave talk titled “Two Combings of \mathbb{Z}^2“. The talk was about some material I have been discussing lately with my honours supervisor. The talk went well and I thought it would be worth sharing a written version of what I said.

Geometric Group Theory

Combings are a tool that gets used in a branch of mathematics called geometric group theory. Geometric group theory is a relatively new area of mathematics and is only around 30 years old. The main idea behind geometric group theory is to use tools and ideas from geometry and low dimensional topology to study and understand groups. It turns out that some of the simplest questions one can ask about groups have interesting geometric answers. For instance, the Dehn function of a group gives a natural geometric answer to the solvability of the word problem.

Generators

Before we can define what a combing is we’ll need to set up some notation. If A is a set then we will write A^* for the set of words written using elements of A and inverses of elements of A. For instance if A = \{a,b\}, then A^* = \{\varepsilon, a,b,a^{-1},b^{-1},aa=a^2,abb^{-1}a^{-3},\ldots\} (here \varepsilon refers to the empty word). If w is a word in A^*, we will write l(w) for the length of w. Thus l(\varepsilon)=0, l(a)=1, l(abb^{-1}a^{-3})=6 and so on.

If G is a group and A is a subset of G, then we have a natural map \pi : A^* \rightarrow G given by:

\pi(a_1^{\pm 1}a_2^{\pm 1}\ldots a_n^{\pm 1}) = a_1^{\pm 1}\cdot a_2^{\pm 1} \cdot \ldots \cdot a_n^{\pm 1}.

We will say that A generates G if the above map is surjective. In this case we will write \overline{w} for \pi(w) when w is a word in A^*.

The Word Metric

The geometry in “geometric group theory” often arises when studying how a group acts on different geometric spaces. A group always acts on itself by left multiplication. The following definition adds a geometric aspect to this action. If G is a group with generators A, then the word metric on G with respect to A is the function d_G : G \times G \rightarrow \mathbb{R} given by

d_G(g,h) = \min \{ l(w) \mid w \in A^*, \overline{w} = g^{-1}h \}.

That, is the distance between two group elements g,h \in G is the length of the shortest word in A^* we can use to represent g^{-1}h. Equivalently the distance between g and h is the length of the shortest word we have to append to g to produce h. This metric is invariant under left-multiplication by G (ie d_G(g\cdot h,g\cdot h') =d_G(h,h') for all g,h,h' \in G). Thus G acts on (G,d_G) by isometries.

Words are Paths

Now that we are viewing the group G as a geometric space, we can also change how we think of words w \in A^*. Such a word can be thought of as discrete path in G. That is we can think of w as a function from \mathbb{N} to G. This way of thinking of w as a discrete path is best illuminated with an example. Suppose we have the word w = ab^2a^{-1}b, then

w(0) = e,
w(1) = \overline{a}
w(2) = \overline{ab}
w(3) = \overline{ab^2}
w(4) = \overline{ab^2a^{-1}}
w(5) = \overline{ab^2a^{-1}b}
w(t) = \overline{ab^2a^{-1}b}, t \ge 5.

Thus the path w : \mathbb{N} \rightarrow G is given by taking the first t letters of w and mapping this word to the group element it represents. With this interpretation of word in A^* in mind we can now define combings.

Combings

Let G be a group with a finite set of generators A. Then a combing of G with respect to A is a function \sigma : G \rightarrow A^* such that

  1. For all g \in G, \overline{\sigma_g} = g (we will write \sigma_g for \sigma(g)).
  2. There exists k >0 such that for all g, h \in G with g \cdot \overline{a} = h for some a \in A, we have that d_G(\sigma_g(t),\sigma_h(t)) \le k for all t \in \mathbb{N}.

The first condition says that we can think of \sigma as a way of picking a normal form \sigma_g \in A^* for each g \in G. The second condition is a bit more involved. It states that if the group elements g, h \in G are distance 1 from each other in the word metric, then the paths \sigma_g,\sigma_h  :  \mathbb{N} \rightarrow G are within distance k of each other at any point in time.

An Example

Not all groups can be given a combing. Indeed if we have a combing of G, then the word problem in G is solvable and the Dehn function of G is at most exponential. One group that does admit a combing is \mathbb{Z}^2 = \{(m,n) \mid m,n \in \mathbb{Z}\}. This group is generated by A = \{(1,0),(0,1)\} = \{\beta,\gamma\} and one combing of \mathbb{Z}^2 with respect to this generating set is

\sigma_{(m,n)} = \beta^m\gamma^n.

The first condition of being a combing is clearly satisfied and the following picture shows that the second condition can be satisfied with k = 2.

A Non-Example

The discrete Heisenberg group, H_3, can be given by the following presentation

H_3 = \langle \alpha,\beta,\gamma \mid \alpha\gamma = \gamma\alpha, \beta\gamma = \gamma\beta, \beta\alpha = \alpha\beta\gamma \rangle.

That is, the group H_3 has three generators \alpha,\beta and \gamma. The generator \gamma commutes with both \alpha and \beta. The generators \alpha and \beta almost commute but don’t quite as seen in the relation \beta\alpha = \alpha\beta\gamma.

Any h \in H_3 can be represented uniquely as \sigma_h = \alpha^p\beta^m\gamma^n for p,m,n \in \mathbb{Z}. To see why such a representation exists it’s best to consider an example. Suppose that h = \overline{\gamma\beta\alpha\gamma\alpha\beta\gamma}. Then we can use the fact that \gamma commutes with \alpha and \beta to push all \gamma‘s to the right and we get that h = \overline{\beta\alpha\alpha\beta\gamma^3}. We can then apply the third relation to switch the order of \alpha and \beta on the right. This gives us that that h = \overline{\alpha\beta\gamma\alpha\beta\gamma^3}=\overline{\alpha\beta\alpha\beta\gamma^4}. If we apply this relation once more we get that h = \overline{\alpha^2\beta^2\gamma^5} and thus \sigma_h = \alpha^2\beta^2\gamma^5. The procedure used to write h in the form \alpha^p\beta^m\gamma^n can be generalized to any word written using \alpha^{\pm 1}, \beta^{\pm 1}, \gamma^{\pm 1}.

The fact the such a representation is unique (that is if \overline{\alpha^p\beta^m\gamma^n} = \overline{\alpha^{p'}\beta^{m'}\gamma^{n'}}, then (p,m,n) = (p',m',n')) is harder to justify but can be proved by defining an action of H_3 on \mathbb{Z}^3. Thus we can define a function \sigma : H_3 \rightarrow \{\alpha,\beta,\gamma\}^* by setting \sigma_h to be the unique word of the form \alpha^p \beta^m\gamma^n that represents h. This map satisfies the first condition of being a combing and has many nice properties. These include that it is easy to check whether or not a word in \{\alpha,\beta,\gamma\}^* is equal to \sigma_h for some h \in H_3 and there are fast algorithms for putting a word in \{\alpha,\beta,\gamma\}^* into its normal form. Unfortunately this map fails to be a combing.

The reason why \sigma : H_3 \rightarrow \{\alpha,\beta,\gamma\}^* fails to be a combing can be seen back when we turned \overline{  \gamma\beta\alpha\gamma\alpha\beta\gamma } into \overline{\alpha^2\beta^2\gamma^5} . To move \alpha‘s on the right to the left we had to move past \beta‘s and produce \gamma‘s in the process. More concretely fix m \in \mathbb{N} and let h = \overline{\beta^m} and g = \overline{\beta^m \alpha} = \overline{\alpha \beta^m\gamma^m}. We have \sigma_h = \beta^m and \sigma_g = \alpha \beta^m \gamma^m. The group elements g and h differ by a generator. Thus, if \sigma was a combing we should be able to uniformly bound d_{H_3}(\sigma_g(t),\sigma_h(t)) for all t \in \mathbb{N} and all m \in \mathbb{N}.

If we then let t = m+1, we can recall that

d_{H_3}(\sigma_g(t),\sigma_h(t)) = \min\{l(w) \mid w \in \{\alpha,\beta,\gamma\}^*, \overline{w} = \sigma_g(t)^{-1}\sigma_h(t)\}.

We have that \sigma_h(t) = \overline{\beta^m} and \sigma_g(t) = \overline{\alpha \beta^m} and thus

\sigma_g(t)^{-1}\sigma_h(t)  = (\overline{\alpha\beta^m})^{-1}\overline{\beta^m} = \overline{\beta^{-m}\alpha^{-1}\beta^m} = \overline{\alpha^{-1}\beta^{-m}\gamma^{m}\beta^m}=\overline{\alpha^{-1}\gamma^{m}}.

The group element \overline{\alpha^{-1}\gamma^{m}} cannot be represented as a shorter element of \{\alpha,\beta,\gamma\}^* and thus d_{H_3}(\sigma_g(t),\sigma_h(t)) = m+1 and the map \sigma is not a combing.

Can we comb the Heisenberg group?

This leaves us with a question, can we comb the group H_3? It turns out that we can but the answer actually lies in finding a better combing of \mathbb{Z}^2. This is because H_3 contains the subgroup \mathbb{Z}^2 \cong \langle \beta, \gamma \rangle \subseteq H_3. Rather than using the normal form \sigma_h = \alpha^p \beta^m \gamma^n, we will use \sigma'_h = \alpha^p \tau_{(m,n)} where \tau : \mathbb{Z}^2 \rightarrow \{\beta,\gamma \}^* is a combing of \mathbb{Z}^2 that is more symmetric. The word \tau_{(m,n)} is defined to be the sequence of m \beta‘s and n \gamma‘s that stays closest to the straight line in \mathbb{R}^2 that joins (0,0) to (n,p) (when we view \beta and \gamma as representing (1,0) and (0,1) respectively). Below is an illustration:

This new function isn’t quite a combing of H_3 but it is the next best thing! It is an asynchronous combing. An asynchronous combing is one where we again require that the paths \sigma_h,\sigma_g stay close to each other whenever h and g are close to each other. However we allow the paths \sigma_h and \sigma_g to travel at different speeds. Many of the results that can be proved for combable groups extend to asynchronously combable groups.

References

Hairdressing in Groups by Sarah Rees is a survey paper that includes lots examples of groups that do or do not admit combings. It also talks about the language complexity of a combing, something I didn’t have time to touch on in my talk.

Combings of Semidirect Products and 3-Manifold Groups by Martin Bridson contains a proof that H_3 is asynchornously combable. He actually proves the more general result that any group of the form \mathbb{Z}^n \rtimes \mathbb{Z}^m is asynchronously combable.

Thank you to my supervisor, Tony Licata, for suggesting I give my talk on combing H_3 and for all the support he has given me so far.