Looking back on the blog

Next week I’ll be starting the second year of my statistics PhD. I’ve learnt a lot from taking the first year classes and studying for the qualifying exams. Some of what I’ve learnt has given me a new perspective on some of my old blog posts. Here are three things that I’ve written about before and I now understand better.

1. The Pi-Lambda Theorem

An early post on this blog was titled “A minimal counterexample in probability theory“. The post was about a theorem from the probability course offered at the Australian National University. The theorem states that if two probability measures agree on a collection of subsets and the collection is closed under finite intersections, then the two probability measures agree on the \sigma-algebra generated by the collection. In my post I give an example which shows that you need the collection to be closed under finite intersections. I also show that you need to have at least four points in the space to find such an example.

What I didn’t know then is that the above theorem is really a corollary of Dykin’s \pi - \lambda theorem. This theorem was proved in my first graduate probability course which was taught by Persi Diaconis. Professor Diaconis kept a running tally of how many times we used the \pi - \lambda theorem in his course and we got up to at least 10. (For more appearances by Professor Diaconis on this blog see here, here and here).

If I were to write the above post again, I would talk about the \pi - \lambda theorem and rename the post “The smallest \lambda-system”. The example given in my post is really about needing at least four points to find a \lambda-system that is not a \sigma-algebra.

2. Mallow’s Cp statistic

The very first blog post I wrote was called “Complexity Penalties in Statistical Learning“. I wasn’t sure if I would write a second and so I didn’t set up a WordPress account. I instead put the post on the website LessWrong. I no longer associate myself with the rationality community but posting to LessWrong was straight forward and helped me reach more people.

The post was inspired in two ways by the 2019 AMSI summer school. First, the content is from the statistical learning course I took at the summer school. Second, at the career fair many employers advised us to work on our writing skills. I don’t know if would have started blogging if not for the AMSI Summer School.

I didn’t know it at the time but the blog post is really about Mallow’s Cp statistic. Mallow’s Cp statistic is an estimate of the test error of a regression model fit using ordinary least squares. The Mallow’s Cp is equal to the training error plus a “complexity penalty” which takes into account the number of parameters. In the blog post I talk about model complexity and over-fitting. I also write down and explain Mallow’s Cp in the special case of polynomial regression.

In the summer school course I took, I don’t remember the name Mallow’s Cp being used but I thought it was a great idea and enjoyed writing about it. The next time encountered Mallow’s Cp was in the linear models course I took last fall. I was delighted to see it again and learn how it fit into a bigger picture. More recently, I read Brad Efron’s paper “How Biased is the Apparent Error Rate of a Prediction Rule?“. The paper introduces the idea of “effective degrees of freedom” and expands on the ideas behind the Cp statistic.

Incidentally, enrolment is now open for the 2023 AMSI Summer School! This summer it will be hosted at the University of Melbourne. I encourage any Australia mathematics or statistics students reading this to take a look and apply. I really enjoyed going in both 2019 and 2020. (Also if you click on the above link you can try to spot me in the group photo of everyone wearing read shirts!)

3. Finitely additive probability measures

In “Finitely additive measures” I talk about how hard it is to define a finitely additive measure on the set of natural numbers that is not countably additive. In particular, I talked about needing to use the Hahn — Banach extension theorem to extend the natural density from the collection of sets with density to the collection of all subsets of the natural numbers.

There were a number of homework problems in my first graduate probability course that relate to this post. We proved that the sets with density are not closed under finite unions and we showed that the square free numbers have density 6/\pi^2.

We also proved that any finite measure defined on an algebra of subsets can be extend to the collection of all subsets. This proof used Zorn’s lemma and the resulting measure is far from unique. The use of Zorn’s lemma relates to the main idea in my blog, that defining an additive probability measure is in some sense non-constructive.

Other posts

Going forward, I hope to continue publishing at least one new post every month. I look forward to one day writing another post like this when I can look back and reflect on how much I have learnt.

Non-measurable sets, cardinality and the axiom of choice

The following post is based on a talk I gave at the 2022 Stanford statistics retreat. The talk was titled “Another non-measurable monster”.

The opening slide from my presentation. The test reads “another non-measurable monster”. There are spooky pumpkins in the background. Image from https://www.shutterstock.com/image-photo/pumpkins-graveyard-spooky-night-halloween-backdrop-1803950377

The material was based on the discussion and references given in this stackexchange post. The title is a reference to a Halloween lecture on measurability given by Professor Persi Diaconis.

What’s scarier than a non-measurable set?

Making every set measurable. Or rather one particular consequence of making every set measurable.

In my talk, I argued that if you make every set measurable, then there exists a set \Omega and an equivalence relation \sim on \Omega such that |\Omega| < |\Omega / \sim|. That is, the set \Omega has strictly smaller cardinality than the set of equivalence classes \Omega/\sim. The contradictory nature of this statement is illustrated in the picture below

We can think of the set \Omega as the collection of crosses drawn above. The equivalence relation \sim divides \Omega into the regions drawn above. The statement |\Omega|<|\Omega /\sim| means that in some sense there are more regions than crosses.

To make sense of this we’ll first have to be a bit more precise about what we mean by cardinality.

What do we mean by bigger and smaller?

Let A and B be two sets. We say that A and B have the same cardinality and write |A| = |B| if there exists a bijection function f:A \to B. We can think of the function f as a way of pairing each element of A with a unique element of B such that every element of B is paired with an element of A.

We next want to define |A|\le |B| which means A has cardinality at most the cardinality of B. There are two reasonable ways in which we could try to define this relationship

  1. We could say |A|\le |B| means that there exists an injective function f : A \to B.
  2. Alternatively, we could |A|\le |B| means that there exists a surjective function g:B \to A.

Definitions 1 and 2 say similar things and, in the presence of the axiom of choice, they are equivalent. Since we are going to be making every set measurable in this talk, we won’t be assuming the axiom of choice. Definitions 1 and 2 are thus no longer equivalent and we have a decision to make. We will use definition . in this talk. For justification, note that definition 1 implies that there exists a subset B' \subseteq B such that |A|=|B|. We simply take B' to be the range of f. This is a desirable property of the relation |A|\le |B| and it’s not clear how this could be done using definition 2.

Infinite binary sequences

It’s time to introduce the set \Omega and the equivalence relation we will be working with. The set \Omega is the set \{0,1\}^\mathbb{Z} the set of all function \omega : \mathbb{Z} \to \{0,1\}. We can think of each elements \omega \in \Omega as an infinite sequence of zeros and ones stretching off in both directions. For example

\omega = \ldots 1110110100111\ldots.

But this analogy hides something important. Each \omega \in \Omega has a “middle” which is the point \omega_0. For instance, the two sequences below look the same but when we make \omega_0 bold we see that they are different.

\omega = \ldots 111011\mathbf{0}100111\ldots,

\omega' = \ldots 1110110\mathbf{1}00111\ldots.

The equivalence relation \sim on \Omega can be thought of as forgetting the location \omega_0. More formally we have \omega \sim \omega' if and only if there exists n \in \mathbb{Z} such that \omega_{n+k} = \omega_{k}' for all k \in \mathbb{Z}. That is, if we shift the sequence \omega by n we get the sequence \omega'. We will use [\omega] to denote the equivalence class of \omega and \Omega/\sim for the set of all equivalences classes.

Some probability

Associated with the space \Omega are functions X_k : \Omega \to \{0,1\}, one for each integer k \in \mathbb{Z}. These functions simply evaluate \omega at k. That is X_k(\omega)=\omega_k. A probabilist or statistician would think of X_k as reporting the result of one of infinitely many independent coin tosses. Normally to make this formal we would have to first define a \sigma-algebra on \Omega and then define a probability on this \sigma-algebra. Today we’re working in a world where every set is measurable and so don’t have to worry about \sigma-algebras. Indeed we have the following result:

(Solovay, 1970)1 There exists a model of the Zermelo Fraenkel axioms of set theory such that there exists a probability \mathbb{P} defined on all subsets of \Omega such that X_k are i.i.d. \mathrm{Bernoulli}(0.5).

This result is saying that there is world in which, other than the axiom of choice, all the regular axioms of set theory holds. And in this world, we can assign a probability to every subset A \subseteq \Omega in a way so that the events \{X_k=1\} are all independent and have probability 0.5. It’s important to note that this is a true countably additive probability and we can apply all our familiar probability results to \mathbb{P}. We are now ready to state and prove the spooky result claimed at the start of this talk.

Proposition: Given the existence of such a probability \mathbb{P}, |\Omega | < |\Omega /\sim|.

Proof: Let f:\Omega/\sim \to \Omega be any function. To show that |\Omega|<|\Omega /\sim| we need to show that f is not injective. To do this, we’ll first define another function g:\Omega \to \Omega given by g(\omega)=f([\omega]). That is, g first maps \omega to \omega‘s equivalence class and then applies f to this equivalence class. This is illustrated below.

A commutative diagram showing the definition of g as g(\omega)=f([\omega]).

We will show that g : \Omega \to \Omega is almost surely constant with respect to \mathbb{P}. That is, there exists \omega^\star \in \Omega such that \mathbb{P}(g(\omega)=\omega^\star)=1. Each equivalence class [\omega] is finite or countable and thus has probability zero under \mathbb{P}. This means that if g is almost surely constant, then f cannot be injective and must map multiple (in fact infinitely many) equivalence classes to \omega^\star.

It thus remains to show that g:\Omega  \to \Omega is almost surely constant. To do this we will introduce a third function \varphi : \Omega \to \Omega. The map \varphi is simply the shift map and is given by \varphi(\omega)_k = \omega_{k+1}. Note that \omega and \varphi(\omega) are in the same equivalence class for every \omega\in \Omega. Thus, the map g satisfies g\circ \varphi = g. That is g is \varphi-invariant.

The map \varphi is ergodic. This means that if A \subseteq \Omega satisfies \varphi(A)=A, then \mathbb{P}(A) equals 0 or 1. For example if A is the event that 10110 appears at some point in \omega, then \varphi(A)=A and \mathbb{P}(A)=`1. Likewise if A is the event that the relative frequency of heads converges to a number strictly greater than 0.5, then \varphi(A)=A and \mathbb{P}(A)=0. The general claim that all \varphi-invariant events have probability 0 or 1 can be proved using the independence of X_k.

For each k, define an event A_k by A_k = \{\omega : g(\omega)_k = 1\}. Since g is \varphi-invariant we have that \varphi(A_k)=A_k. Thus, \mathbb{P}(A_k)=0 or 1. This gives us a function \omega^\star :\mathbb{Z} \to \{0,1\} given by \omega^\star_k = \mathbb{P}(A_k). Note that for every k, \mathbb{P}(\{\omega : g(\omega)_k = \omega_k^\star\}) = 1. This is because if w_{k}^\star=1, then \mathbb{P}(\{\omega: g(\omega)_k = 1\})=1, by definition of w_k^\star. Likewise if \omega_k^\star =0, then \mathbb{P}(\{\omega:g(\omega)_k=1\})=0 and hence \mathbb{P}(\{\omega:g(\omega)_k=0\})=1. Thus, in both cases, \mathbb{P}(\{\omega : g(\omega)_k = \omega_k^*\})= 1.

Since \mathbb{P} is a probability measure, we can conclude that

\mathbb{P}(\{\omega : g(\omega)=\omega^\star\}) = \mathbb{P}\left(\bigcap_{k \in \mathbb{Z}} \{\omega : g(\omega)_k = \omega_k^\star\}\right)=1.

Thus, g map \Omega to \omega^\star with probability one. Showing that g is almost surely constant and hence that f is not injective. \square

There’s a catch!

So we have proved that there cannot be an injective map f : \Omega/\sim \to \Omega. Does this mean we have proved |\Omega| < |\Omega/\sim|? Technically no. We have proved the negation of |\Omega/\sim|\le |\Omega| which does not imply |\Omega| \le |\Omega/\sim|. To argue that |\Omega| < |\Omega/\sim| we need to produce a map g: \Omega \to \Omega/\sim that is injective. Surprising this is possible and not too difficult. The idea is to find a map g : \Omega \to \Omega such that g(\omega)\sim g(\omega') implies that \omega = \omega'. This can be done by somehow encoding in g(\omega) where the centre of \omega is.

A simpler proof and other examples

Our proof was nice because we explicitly calculated the value \omega^\star where g sent almost all of \Omega. We could have been less explicit and simply noted that the function g:\Omega \to \Omega was measurable with respect to the invariant \sigma-algebra of \varphi and hence almost surely constant by the ergodicity of \varphi.

This quicker proof allows us to generalise our “spooky result” to other sets. Below are two examples where \Omega = [0,1)

  • Fix \theta \in [0,1)\setminus \mathbb{Q} and define \omega \sim \omega' if and only if \omega + n \theta= \omega' for some n \in \mathbb{Z}.
  • \omega \sim \omega' if and only if \omega - \omega' \in \mathbb{Q}.

A similar argument can be used to show that in Solovay’s world |\Omega| < |\Omega/\sim|. The exact same argument follows from the ergodicity of the corresponding actions on \Omega under the uniform measure.

Three takeaways

I hope you agree that this example is good fun and surprising. I’d like to end with some remarks.

  • The first remark is some mathematical context. This argument given today is linked to some interesting mathematics called descriptive set theory. This field studies the properties of well behaved subsets (such as Borel subsets) of topological spaces. Descriptive set theory incorporates logic, topology and ergodic theory. I don’t know much about the field but in Persi’s Halloween talk he said that one “monster” was that few people are interested in the subject.
  • The next remark is a better way to think about our “spooky result”. The result is really saying something about cardinality. When we no longer use the axiom of choice, cardinality becomes a subtle concept. The statement |A|\le |B| no longer corresponds to A being “smaller” than B but rather that A is “less complex” than B. This is perhaps analogous to some statistical models which may be “large” but do not overfit due to subtle constraints on the model complexity.
  • In light of the previous remark, I would invite you to think about whether the example I gave is truly spookier than non-measurable sets. It might seem to you that it is simply a reasonable consequence of removing the axiom of choice and restricting ourselves to functions we could actually write down or understand. I’ll let you decide

Footnotes

  1. Technically Solovay proved that there exists a model of set theory such that every subset of \mathbb{R} is Borel measurable. To get the result for binary sequences we have to restrict to [0,1) and use the binary expansion of x \in [0,1) to define a function [0,1) \to \Omega. Solvay’s paper is available here https://www.jstor.org/stable/1970696?seq=1

Double cosets and contingency tables

Like my previous post, this blog is also motivated by a comment by Professor Persi Diaconis in his recent Stanford probability seminar. The seminar was about a way of “collapsing” a random walk on a group to a random walk on the set of double cosets. In this post, I’ll first define double cosets and then go over the example Professor Diaconis used to make us probabilists and statisticians more comfortable with all the group theory he was discussing.

Double cosets

Let G be a group and let H and K be two subgroups of G. For each g \in G, the (H,K)-double coset containing g is defined to be the set

HgK = \{hgk : h \in H, k \in K\}

To simplify notation, we will simply write double coset instead of (H,K)-double coset. The double coset of g can also be defined as the equivalence class of g under the relation

g \sim g' \Longleftrightarrow g' = hgk for some h \in H and g \in G

Like regular cosets, the above relation is indeed an equivalence relation. Thus, the group G can be written as a disjoint union of double cosets. The set of all double cosets of G is denoted by H\backslash G / K. That is,

H\backslash G /K = \{HgK : g \in G\}

Note that if we take H = \{e\}, the trivial subgroup, then the double cosets are simply the left cosets of K, G / K. Likewise if K = \{e\}, then the double cosets are the right cosets of H, H \backslash G. Thus, double cosets generalise both left and right cosets.

Double cosets in S_n

Fix a natural number n > 0. A partition of n is a finite sequence a = (a_1,a_2,\ldots, a_I) such that a_1,a_2,\ldots, a_I \in \mathbb{N}, a_1 \ge a_2 \ge \ldots a_I > 0 and \sum_{i=1}^I a_i = n. For each partition of n, a, we can form a subgroup S_a of the symmetric group S_n. The subgroup S_a contains all permutations \sigma \in S_n such that \sigma fixes the sets A_1 = \{1,\ldots, a_1\}, A_2 = \{a_1+1,\ldots, a_1 + a_2\},\ldots, A_I = \{n - a_I +1,\ldots, n\}. Meaning that \sigma(A_i) = A_i for all 1 \le i \le I. Thus, a permutation \sigma \in S_a must individually permute the elements of A_1, the elements of A_2 and so on. This means that, in a natural way,

S_a \cong S_{a_1} \times S_{a_2} \times \ldots \times S_{a_I}

If we have two partitions a = (a_1,a_2,\ldots, a_I) and b = (b_1,b_2,\ldots,b_J), then we can form two subgroups H = S_a and K = S_b and consider the double cosets H \backslash G / K = S_a \backslash S_n / S_b. The claim made in the seminar was that the double cosets are in one-to-one correspondence with I \times J contingency tables with row sums equal to a and column sums equal to b. Before we explain this correspondence and properly define contingency tables, let’s first consider the cases when either H or K is the trivial subgroup.

Left cosets in S_n

Note that if a = (1,1,\ldots,1), then S_a is the trivial subgroup and, as noted above, S_a \backslash S_n / S_b is simply equal to S_n / S_b. We will see that the cosets in S_n/S_b can be described by forgetting something about the permutations in S_n.

We can think of the permutations in S_n as all the ways of drawing without replacement n balls labelled 1,2,\ldots, n. We can think of the partition b = (b_1,b_2,\ldots,b_J) as a colouring of the n balls by J colours. We colour balls 1,2,\ldots, b_1 by the first colour c_1, then we colour b_1+1,b_1+2,\ldots, b_1+b_2 the second colour c_2 and so on until we colour n-b_J+1, n-b_J+2,\ldots, n the final colour c_J. Below is an example when n is equal to 6 and b = (3,2,1).

The first three balls are coloured green, the next two are coloured red and the last ball is coloured blue.

Note that a permutation \sigma \in S_n is in S_b if and only if we draw the balls by colour groups, i.e. we first draw all the balls with colour c_1, then we draw all the balls with colour c_2 and so on. Thus, continuing with the previous example, the permutation \sigma_1 below is in S_b but \sigma_2 is not in S_b.

The permutation \sigma_1 is in S_b because the colours are in their original order but \sigma_1 is not in S_b because the colours are rearranged.

It turns out that we can think of the cosets in S_n \setminus S_b as what happens when we “forget” the labels 1,2,\ldots,n and only remember the colours of the balls. By “forgetting” the labels we mean only paying attention to the list of colours. That is for all \sigma_1,\sigma_2 \in S_n, \sigma_1 S_b = \sigma_2 S_b if and only if the list of colours from the draw \sigma_1 is the same as the list of colours from the draw \sigma_2. Thus, the below two permutations define the same coset of S_b

When we forget the labels and only remember the colours, the permutations \sigma_1 and \sigma_2 look the same and thus are in the same left coset of S_b.

To see why this is true, note that \sigma_1 S_b = \sigma_2 S_b if and only if \sigma_2 = \sigma_1 \circ \tau for some \tau \in S_b. Furthermore, \tau \in S_b if and only if \tau maps each colour group to itself. Recall that function composition is read right to left. Thus, the equation \sigma_2 = \sigma_1 \circ \tau means that if we first relabel the balls according to \tau and then draw the balls according to \sigma_1, then we get the result as just drawing by \sigma_2. That is, \sigma_2 = \sigma_1 \circ \tau for some \tau \in S_b if and only if drawing by \sigma_2 is the same as first relabelling the balls within each colour group and then drawing the balls according to \sigma_1. Thus, \sigma_1 S_b = \sigma_2 S_b, if and only if when we forget the labels of the balls and only look at the colours, \sigma_1 and \sigma_2 give the same list of colours. This is illustrated with our running example below.

If we permute the balls according to \tau \in S_b and the draw according to \sigma_1, then the resulting draw is the same as if we had not permuted and drawn according to \sigma_2. That is, \sigma_2 = \sigma_1 \circ \tau.

Right cosets of S_n

Typically, the subgroup S_a is not a normal subgroup of S_n. This means that the right coset S_a \sigma will not equal the left coset \sigma S_a. Thus, colouring the balls and forgetting the labelling won’t describe the right cosets S_a \backslash S_n. We’ll see that a different type of forgetting can be used to describe S_a \backslash S_n = \{S_a\sigma : \sigma \in S_n\}.

Fix a partition a = (a_1,a_2,\ldots,a_I) and now, instead of considering I colours, think of I different people p_1,p_2,\ldots,p_I. As before, a permutation \sigma \in S_n can be thought of drawing n balls labelled 1,\ldots,n without replacement. We can imagine giving the first a_1 balls drawn to person p_1, then giving the next a_2 balls to the person p_2 and so on until we give the last a_I balls to person p_I. An example with n = 6 and a = (2,2,2) is drawn below.

Person p_1 receives the ball labelled by 6 followed by the ball labelled 3, person p_2 receives ball 2 and then ball 1 and finally person p_3 receives ball 4 followed by ball 5.

Note that \sigma \in S_a if and only if person p_i receives the balls with labels \sum_{k=0}^{i-1}a_k+1,\sum_{k=0}^{i-1}a_k+2,\ldots, \sum_{k=0}^i a_k in any order. Thus, in the below example \sigma_1 \in S_a but \sigma_2 \notin S_a.

When the balls are drawn according to \sigma_1, person p_i receives the balls with labels 2i-1 and 2i, and thus \sigma_1 \in S_a. On the other hand, if the balls are drawn according to \sigma_2, the people receive different balls and thus \sigma_2 \notin S_a.

It turns out the cosets S_a \backslash S_n are exactly determined by “forgetting” the order in which each person p_i received their balls and only remembering which balls they received. Thus, the two permutation below belong to the same coset in S_a \backslash S_n.

When we forget the order in which each person receive their balls, the permutations \sigma_1 and \sigma_2 become the same and thus S_a \sigma_1 = S_a \sigma_2. Note that if we coloured the balls according to the permutation a = (a_1,\ldots,a_I), then we could see that \sigma_1 S_a \neq \sigma_2 S_a.

To see why this is true in general, consider two permutations \sigma_1,\sigma_2. The permutations \sigma_1,\sigma_2 result in each person p_i receiving the same balls if and only if after \sigma_1 we can apply a permutation \tau that fixes each subset A_i  =  \left\{\sum_{k=0}^{i-1}a_k+1,\ldots, \sum_{k=0}^i a_k\right\} and get \sigma_2. That is, \sigma_1 and \sigma_2 result in each person p_i receiving the same balls if and only if \sigma_2 = \tau \circ \sigma_1 for some \tau \in S_a. Thus, \sigma_1,\sigma_2 are the same after forgetting the order in which each person received their balls if and only if S_a \sigma_1 = S_a \sigma_2. This is illustrated below,

If we first draw the balls according to \sigma_1 and then permute the balls according to \tau, then the resulting draw is the same as if we had drawn according to \sigma_2 and not permuted afterwards. That is, \sigma_2 = \tau \circ \sigma_1 .

We can thus see why S_a \backslash S_n \neq S_n / S_a. A left coset \sigma S_a correspond to pre-composing \sigma with elements of S_a and a right cosets S_a\sigma correspond to post-composing \sigma with elements of S_a.

Contingency tables

With the last two sections under our belts, describing the double cosets S_a \backslash S_n / S_b is straight forward. We simply have to combine our two types of forgetting. That is, we first colour the n balls with J colours according to b = (b_1,b_2,\ldots,b_J). We then draw the balls without replace and give the balls to I different people according a = (a_1,a_2,\ldots,a_I). We then forget both the original labels and the order in which each person received their balls. That is, we only remember the number of balls of each colour each person receives. Describing the double cosets by “double forgetting” is illustrated below with a = (2,2,2) and b = (3,2,1).

The permutations \sigma_1 and \sigma_2 both result in person p_1 receiving one green ball and one blue ball. The two permutations also results in p_2 and p_3 both receiving one green ball and one red ball. Thus, \sigma_1 and \sigma_2 are both in the same (S_a,S_b)-double coset. Note however that \sigma_1 S_b \neq \sigma_2 S_b and S_a\sigma_1 \neq S_a \sigma_2.

The proof that double forgetting does indeed describe the double cosets is simply a combination of the two arguments given above. After double forgetting, the number of balls given to each person can be recorded in an I \times J table. The N_{ij} entry of the table is simply the number of balls person p_i receives of colour c_j. Two permutations are the same after double forgetting if and only if they produce the same table. For example, \sigma_1 and \sigma_2 above both produce the following table

Green (c_1) Red (c_2)Blue (c_3)Total
Person p_11012
Person p_21102
Person p_31102
Total3216

By the definition of how the balls are coloured and distributed to each person we must have for all 1 \le i \le I and 1 \le j \le J

\sum_{j=1}^J N_{ij} = a_i and \sum_{i=1}^I N_{ij} = b_j

An I \times J table with entries N_{ij} \in \{0,1,2,\ldots\} satisfying the above conditions is called a contingency table. Given such a contingency table with entries N_{ij} where the rows sum to a and the columns sum to b, there always exists at least one permutation \sigma such that N_{ij} is the number of balls received by person p_i of colour c_i. We have already seen that two permutations produce the same table if and only if they are in the same double coset. Thus, the double cosets S_a \backslash S_n /S_b are in one-to-one correspondence with such contingency tables.

The hypergeometric distribution

I would like to end this blog post with a little bit of probability and relate the contingency tables above to the hyper geometric distribution. If a = (m, n-m) for some m \in \{0,1,\ldots,n\}, then the contingency tables described above have two rows and are determined by the values N_{11}, N_{12},\ldots, N_{1J} in the first row. The numbers N_{1j} are the number of balls of colour c_j the first person receives. Since the balls are drawn without replacement, this means that if we put the uniform distribution on S_n, then the vector Y=(N_{11}, N_{12},\ldots, N_{1J}) follows the multivariate hypergeometric distribution. Thus, if we have a random walk on S_n that quickly converges to the uniform distribution on S_n, then we could use the double cosets to get a random walk that converges to the multivariate hypergeometric distribution (although there are smarter ways to do such sampling).

A not so simple conditional expectation

It is winter 2022 and my PhD cohort has moved on the second quarter of our first year statistics courses. This means we’ll be learning about generalised linear models in our applied course, asymptotic statistics in our theory course and conditional expectations and martingales in our probability course.

In the first week of our probability course we’ve been busy defining and proving the existence of the conditional expectation. Our approach has been similar to how we constructed the Lebesgue integral in the previous course. Last quarter, we first defined the Lebesgue integral for simple functions, then we used a limiting argument to define the Lebesgue integral for non-negative functions and then finally we defined the Lebesgue integral for arbitrary functions by considering their positive and negative parts.

Our approach to the conditional expectation has been similar but the journey has been different. We again started with simple random variables, then progressed to non-negative random variables and then proved the existence of the conditional expectation of any arbitrary integrable random variable. Unlike the Lebesgue integral, the hardest step was proving the existence of the conditional expectation of a simple random variable. Progressing from simple random variables to arbitrary random variables was a straight forward application of the monotone convergence theorem and linearity of expectation. But to prove the existence of the conditional expectation of a simple random variable we needed to work with projections in the Hilbert space L^2(\Omega, \mathbb{P},\mathcal{F}).

Unlike the Lebesgue integral, defining the conditional expectation of a simple random variable is not straight forward. One reason for this is that the conditional expectation of a random variable need not be a simple random variable. This comment was made off hand by our Professor and sparked my curiosity. The following example is what I came up with. Below I first go over some definitions and then we dive into the example.

A simple random variable with a conditional expectation that is not simple

Let (\Omega, \mathbb{P}, \mathcal{F}) be a probability space and let \mathcal{G} \subseteq \mathcal{F} be a sub-\sigma-algebra. The conditional expectation of an integrable random variable X is a random variable \mathbb{E}(X|\mathcal{G}) that satisfies the following two conditions:

  1. The random variable \mathbb{E}(X|\mathcal{G}) is \mathcal{G}-measurable.
  2. For all B \in \mathcal{G}, \mathbb{E}[X1_B] = \mathbb{E}[\mathbb{E}(X|\mathcal{G})1_B], where 1_B is the indicator function of B.

The conditional expectation of an integrable random variable is unique and always exists. One can think of \mathbb{E}(X|\mathcal{G}) as the expected value of X given the information in \mathcal{G}.

A simple random variable is a random variable X that take only finitely many values. Simple random variables are always integrable and so \mathbb{E}(X|\mathcal{G}) always exists but we will see that \mathbb{E}(X|\mathcal{G}) need not be simple.

Consider a random vector (U,V) uniformly distributed on the square [-1,1]^2 \subseteq \mathbb{R}^2. Let D be the unit disc D = \{(u,v) \in \mathbb{R}^2:u^2+v^2 \le 1\}. The random variable X = 1_D(U,V) is a simple random variable since X equals 1 if (U,V) \in D and X equals 0 otherwise. Let \mathcal{G} = \sigma(U) the \sigma-algebra generated by U. It turns out that

\mathbb{E}(X|\mathcal{G}) = \sqrt{1-U^2}.

Thus \mathbb{E}(X|\mathcal{G}) is not a simple random variable. Let Y = \sqrt{1-U^2}. Since Y is a continuous function of U, the random variable is \mathcal{G}-measurable. Thus Y satisfies condition 1. Furthermore if B \in \mathcal{G}, then B = \{U \in A\} for some measurable set A\subseteq [-1,1]. Thus X1_B equals 1 if and only if U \in A and V \in [-\sqrt{1-U^2}, \sqrt{1-U^2}]. Since (U,V) is uniformly distributed we thus have

\mathbb{E}[X1_B] = \int_A \int_{-\sqrt{1-u^2}}^{\sqrt{1-u^2}} \frac{1}{4}dvdu = \int_A \frac{1}{2}\sqrt{1-u^2}du.

The random variable U is uniformly distributed on [-1,1] and thus has density \frac{1}{2}1_{[-1,1]}. Therefore,

\mathbb{E}[Y1_B] = \mathbb{E}[\sqrt{1-U^2}1_{\{U \in A\}}] = \int_A \frac{1}{2}\sqrt{1-u^2}du.

Thus \mathbb{E}[X1_B] = \mathbb{E}[Y1_B] and therefore Y = \sqrt{1-U^2} equals \mathbb{E}(X|\mathcal{G}). Intuitively we can see this because given U=u, we know that X is 1 when V \in [-\sqrt{1-u^2},\sqrt{1+u^2}] and that X is 0 otherwise. Since V is uniformly distributed on [-1,1] the probability that V is in [-\sqrt{1-u^2},\sqrt{1+u^2}] is \sqrt{1-u^2}. Thus given U=u, the expected value of X is \sqrt{1-u^2}.

An extension

The previous example suggests an extension that shows just how “complicated” the conditional expectation of a simple random variable can be. I’ll state the extension as an exercise:

Let f:[-1,1]\to \mathbb{R} be any continuous function with f(x) \in [0,1]. With (U,V) and \mathcal{G} as above show that there exists a measurable set A \subseteq [-1,1]^2 such that \mathbb{E}(1_A|\mathcal{G}) = f(U).

Extremal couplings

This post is inspired by an assignment question I had to answer for STATS 310A – a probability course at Stanford for first year students in the statistics PhD program. In the question we had to derive a few results about couplings. I found myself thinking and talking about the question long after submitting the assignment and decided to put my thoughts on paper. I would like to thank our lecturer Prof. Diaconis for answering my questions and pointing me in the right direction.

What are couplings?

Given two distribution functions F and G on \mathbb{R}, a coupling of F and G is a distribution function H on \mathbb{R}^2 such that the marginals of H are F and G. Couplings can be used to give probabilistic proofs of analytic statements about F and G (see here). Couplings are also are studied in their own right in the theory optimal transport.

We can think of F and G as being the cumulative distribution functions of some random variables X and Y. A coupling H of F and G thus corresponds to a random vector (\widetilde{X},\widetilde{Y}) where \widetilde{X} has the same distribution as X, \widetilde{Y} has the same distribution as Y and (\widetilde{X},\widetilde{Y})  \sim H.

The independent coupling

For two given distributions function F and G there exist many possible couplings. For example we could take H = H_I where H_I(x,y) = F(x)G(y). This coupling corresponds to a random vector (\widetilde{X}_I,\widetilde{Y}_I) where \widetilde{X}_I and \widetilde{Y}_I are independent and (as is required for all couplings) \widetilde{X}_I \stackrel{\text{dist}}{=} X, \widetilde{Y}_I \stackrel{\text{dist}}{=} Y.

In some sense the coupling H_I is in the “middle” of all couplings. This is because \widetilde{X} and \widetilde{Y} are independent and so \widetilde{X} doesn’t carry any information about \widetilde{Y}. As the title of the post suggests, there are couplings were this isn’t the case and \widetilde{X} carries “as much information as possible” about \widetilde{Y}.

The two extremal couplings

Define two function H_L, H_U :\mathbb{R}^2 \to [0,1] by

H_U(x,y) = \min\{F(x), G(y)\} and H_L(x,y) = \max\{F(x)+G(y) - 1, 0\}.

With some work, one can show that H_L and H_U are distributions functions on \mathbb{R}^2 and that they have the correct marginals. In this post I would like to talk about how to construct random vectors (\widetilde{X}_U, \widetilde{Y}_U) \sim H_U and (\widetilde{X}_L, \widetilde{Y}_L) \sim H_L.

Let F^{-1} and G^{-1} be the quantile functions of F and G. That is,

F^{-1}(c) = \inf\{ x \in \mathbb{R} : F(x) \ge c\} and G^{-1}(c) = \inf\{ x \in \mathbb{R} : G(x) \ge c\}.

Now let V be a random variable that is uniformly distributed on [0,1] and define

\widetilde{X}_U = F^{-1}(V) and \widetilde{Y}_U = G^{-1}(V).

Since F^{-1}(V) \le x if and only if V \le F(x), we have \widetilde{X}_U \stackrel{\text{dist}}{=} X and likewise \widetilde{Y}_U \stackrel{\text{dist}}{=} Y. Furthermore \widetilde{X}_U \le x, \widetilde{Y}_U \le y occurs if and only if V \le F(x), V \le G(y) which is equivalent to V \le \min\{F(x),G(y)\}. Thus

\mathbb{P}(\widetilde{X}_U \le x, \widetilde{Y}_U \le y) = \mathbb{P}(V \le \min\{F(x),G(y)\})= \min\{F(x),G(y)\}.

Thus (\widetilde{X}_U,\widetilde{Y}_U) is distributed according to H_U. We see that under the coupling H_U, \widetilde{X}_U and \widetilde{Y}_U are closely related as they are both increasing functions of a common random variable V.

We can follow a similar construction for H_L. Define

\widetilde{X}_L = F^{-1}(V) and \widetilde{Y}_L = G^{-1}(1-V).

Thus \widetilde{X}_L and \widetilde{Y}_L are again functions of a common random variable V but \widetilde{X}_L is an increasing function of V and \widetilde{Y}_L is a decreasing function of V. Note that 1-V is also uniformly distributed on [0,1]. Thus \widetilde{X}_L \stackrel{\text{dist}}{=} X and \widetilde{Y}_L \stackrel{\text{dist}}{=} Y.

Now \widetilde{X}_L \le x, \widetilde{Y}_L \le y occurs if and only if V \le F(x) and 1-V \le G(y) which occurs if and only if 1-G(y) \le V \le F(x). If 1-G(y) \le F(x), then F(x)+G(y)-1 \ge 0 and \mathbb{P}(1-G(y) \le V \le F(x)) =F(x)+G(y)-1. On the other hand, if 1 - G(y) > F(x), then F(x)+G(y)-1< 0 and \mathbb{P}(1-G(y) \le V \le F(x))=0. Thus

\mathbb{P}(\widetilde{X}_L \le x, \widetilde{Y}_L \le y) = \mathbb{P}(1-G(y) \le V \le F(x)) = \max\{F(x)+G(y)-1,0\},

and so (\widetilde{X}_L,\widetilde{Y}_L) is distributed according to H_L.

What makes H_U and H_L extreme?

Now that we know that H_U and H_L are indeed couplings, it is natural to ask what makes them “extreme”. What we would like to say is that \widetilde{Y}_U is an increasing function of \widetilde{X}_U and \widetilde{Y}_L is a decreasing function of \widetilde{X}_L. Unfortunately this isn’t always the case as can be seen by taking X to be constant and Y to be continuous.

However the intuition that \widetilde{Y}_U is increasing in \widetilde{X}_U and \widetilde{Y}_L is decreasing in \widetilde{X}_L is close to correct. Given a coupling (\widetilde{X},\widetilde{Y}) \sim H, we can look at the quantity

C(x,y) = \mathbb{P}(\widetilde{Y} \le y | \widetilde{X} \le x) -\mathbb{P}(\widetilde{Y} \le y) = \frac{H(x,y)}{F(x)}-G(y)

This quantity tells us something about how \widetilde{Y} changes with \widetilde{X}. For instance if \widetilde{X} and \widetilde{Y} were positively correlated, then C(x,y) would be positive and if \widetilde{X} and \widetilde{Y} were negatively correlated, then C(x,y) would be negative.

For the independent coupling (\widetilde{X}_I,\widetilde{Y}_I) \sim H_I, the quantity C(x,y) is constantly 0. It turns out that the above probability is maximised by the coupling (\widetilde{X}_U, \widetilde{Y}_U) \sim H_U and minimised by (\widetilde{X}_L,\widetilde{Y}_L) \sim H_L and it is in this sense that they are extremal. This final claim is the two dimensional version of the Fr├ęchet-Hoeffding Theorem and checking it is a good exercise.

Finitely Additive Measures

I am again tutoring the course MATH3029. The content is largely the same but the name has changed from “Probability Modelling and Applications” to “Probability Theory and Applications” to better reflect the material taught. There was a good question on the first assignment that leads to some interesting mathematics.

The Question

The assignment question is as follows. Let \Omega be a set and let \mathcal{F} \subseteq \mathcal{P}(\Omega) be a \sigma-algebra on \Omega. Let \mathbb{P} : \mathcal{F} \to [0,\infty) be a function with the following properties

  1. \mathbb{P}(\Omega) = 1.
  2. For any finite sequence of pairwise disjoint sets (A_k)_{k=1}^n in \mathcal{F}, we have \mathbb{P}\left(\bigcup_{k=1}^n A_k \right) = \sum_{k=1}^n \mathbb{P}(A_k).
  3. If (B_n)_{n=1}^\infty is a sequence of sets in \mathcal{F} such that B_{n+1} \subseteq B_n for all n \in \mathbb{N} and \bigcap_{n=1}^\infty A_n = \emptyset, then, as n tends to infinity, \mathbb{P}(A_n) \to 0.

Students were then asked to show that the function \mathbb{P} is a probability measure on (\Omega, \mathcal{F}). This amounts to showing that \mathbb{P} is countably additive. That is if (A_k)_{k=1}^\infty is a sequence of pairwise disjoint sets, then \mathbb{P}\left(\cup_{k=1}^\infty A_k\right) = \sum_{k=1}^\infty \mathbb{P}(A_k). One way to do this is define B_n = \bigcup_{k=n+1}^\infty A_k. Since the sets (A_k)_{k=1}^\infty are pairwise disjoint, the sets (B_n)_{n=1}^\infty satisfy the assumptions of the third property of \mathbb{P}. Thus we can conclude that \mathbb{P}(B_n) \to 0 as n \to \infty.

We also have that for every n \in \mathbb{N} we have \bigcup_{k=1}^\infty A_k = \left(\bigcup_{k=1}^n A_k\right) \cup B_n. Thus by applying the second property of \mathbb{P} twice we get

\mathbb{P}\left(\bigcup_{k=1}^\infty A_k \right) = \mathbb{P}\left( \bigcup_{k=1}^n A_k\right) + \mathbb{P}(B_n) = \sum_{k=1}^n \mathbb{P}(A_k) + \mathbb{P}(B_n).

If we let n tend to infinity, then we get the desired result.

A Follow Up Question

A natural follow up question is whether all three of the assumptions in the question are necessary. It is particularly interesting to ask if there is an example of a function \mathbb{P} that satisfies the first two properties but is not a probability measure. It turns out the answer is yes but coming up with an example involves some serious mathematics.

Let \Omega be the set of natural numbers \mathbb{N} = \{1,2,3,\ldots\} and let \mathcal{F} be the power set of \mathbb{N}.

One way in which people talk about the size of a subset of natural numbers A \subseteq \mathbb{N} is to look at the proportion of elements in A and take a limit. That is we could define

\mathbb{P}(A) = \lim_{n \to \infty}\frac{|\{k \in A \mid k \le n \}|}{n}.

This function \mathbb{P} has some nice properties for instance if A is the set of all even numbers than \mathbb{P}(A) = 1/2. More generally if A is the set of all numbers divisible by k, then \mathbb{P}(A) = 1/k. The function \mathbb{P} gets used a lot. When people say that almost all natural numbers satisfy a property, they normally mean that if A is the subset of all numbers satisfying the property, then \mathbb{P}(A)=1.

However the function \mathbb{P} is not a probability measure. The function \mathbb{P} is finitely additive. To see this, let (A_i)_{i=1}^m be a finite collection of disjoint subsets of \mathbb{N} and let A = \bigcup_{i=1}^m A_i. Then for every natural number n,

\{k \in A \mid k \le n \} = \bigcup_{i=1}^m \{k \in A_i \mid k \le n\}.

Since the sets (A_i)_{i=1}^m are disjoint, the union on the right is a disjoint union. Thus we have

\frac{|\{k \in A \mid k \le n \}|}{n} = \sum_{i=1}^m \frac{|\{k \in A_i \mid k \le n \}|}{n}.

Taking limits on both sides gives \mathbb{P}(A)=\sum_{i=1}^m \mathbb{P}(A_i), as required. Furthermore, the function \mathbb{P} is not countably additive. For instance if we let A_i = \{i\} for each i \in \mathbb{N}. Then \bigcup_{i=1}^\infty A_i = \mathbb{N} and \mathbb{P}(\mathbb{N})=1. On the other hand \mathbb{P}(A_i)=0 for every i \in \mathbb{N} and hence \sum_{i=1}^\infty \mathbb{P}(A_i)=0\neq \mathbb{P}(\mathbb{N}).

Thus it would appear that we have an example of a finitely additive measure that is not countably additive. However there is a big problem with the above definition of \mathbb{P}. Namely the limit of \frac{|\{k \in A \mid k \le n \}|}{n} does not always exist. Consider the set A = \{3,4,9,10,11,12,13,14,15,16,33,\ldots\}, ie a number k \in \mathbb{N} is in A if and only if 2^{m} < k \le 2^{m+1} for some odd number m\ge 1. The idea with the set A is that it looks a little bit like this:

There are chunks of numbers that alternate between being in A and not being in A and as we move further along, these chunks double in size. Let a_n represent the sequence of numbers \frac{|\{k \in A \mid k \le n \}|}{n}. We can see that a_n increases while n is in a chunk that belongs to A and decreases when n is in a chunk not in A. More specifically if 2^{2m-1} < n \le 2^{2m}, then a_n is increasing but if 2^{2m} < n \le 2^{2m+1}, then a_n is decreasing.

At the turning points n = 2^{2m} or n = 2^{2m+1} we can calculate exactly what a_n is equal to. Note that

\{k \in A \mid k \le 2^{2m} \} = 2+8+32+\ldots+2\cdot 4^{m-1} = 2\cdot \frac{4^m-1}{4-1} = \frac{2}{3}(4^m-1).

Furthermore since there are no elements of A between 2^{2m} and 2^{2m+1} we have

\{k \in A \mid k \le 2^{2m+1}\} = \frac{2}{3}(4^m-1).

Thus we have

a_{2^m} = \frac{\frac{2}{3}(4^m-1)}{2^{2m}}=\frac{2}{3}\frac{4^m-1}{4^m} and a_{2m+1} = \frac{\frac{2}{3}(4^m-1)}{2^{2m+1}}=\frac{1}{3}\frac{4^m-1}{4^m}.

Hence the values a_n fluctuate between approaching \frac{1}{3} and \frac{2}{3}. Thus the limit of a_n does not exist and hence \mathbb{P} is not well-defined.

There is a work around using something called a Banach limit. Banach limits are a way of extending the notion of a limit from the space of convergent sequences to the space of bounded sequences. Banach limits aren’t uniquely defined and don’t have a formula describing them. Indeed to prove the existence of Banach limits one has to rely on non-constructive mathematics such as the Hanh-Banach extension theorem. So if we take for granted the existence of Banach limits we can define

\mathbb{P}(A) = L\left( \left(\frac{|\{k \in A \mid k \le n \}|}{n}\right)_{n=1}^\infty \right),

where L is now a Banach limit. This new definition of \mathbb{P} is defined on all subsets of \mathbb{N} and is an example of measure that is finitely additive but not countably additive. However the definition of \mathbb{P} is very non-constructive. Indeed there are models of ZF set theory where the Hanh-Banach theorem does not hold and we cannot prove the existence of Banach limits.

This begs the question of whether or not there exist constructible examples of measures that are finitely additive but not countably additive. A bit of Googling reveals that non-principal ultrafilters provide another way of defining non-countably additive measures. However the existence of a non-principal ultrafilter on \mathbb{N} is again equivalent to a weak form of the axiom of choice. Thus it seems that the existence of a non-countably additive measure may be inherently non-constructive. This discussion on Math Overflow goes into more detail.

A minimal counterexample in probability theory

Last semester I tutored the course Probability Modelling with Applications. In this course the main objects of study are probability spaces. A probability space is a triple (\Omega, \mathcal{F}, \mathbb{P}) where:

  1. \Omega is a set.
  2. \mathcal{F} is a \sigma-algebra on \Omega. That is \mathcal{F} is a collection of subsets of \Omega such that \Omega \in \mathcal{F} and \mathcal{F} is closed under set complements and countable unions. The element of \mathcal{F} are called events and they are precisely the subsets of \Omega that we can assign probabilities to. We will denote the power set of \Omega by 2^\Omega and hence \mathcal{F} \subseteq 2^\Omega.
  3. \mathbb{P} is a probability measure. That is it is a function \mathbb{P}: \mathcal{F} \rightarrow [0,1] such that \mathbb{P}(\Omega)=1 and for all countable collections \{A_i\}_{i=1}^\infty \subseteq \mathcal{F} of mutually disjoint subsets we have that \mathbb{P} \left(\bigcup_{i=1}^\infty A_i \right) = \sum_{i=1}^\infty \mathbb{P}(A_i) .

It’s common for students to find probability spaces, and in particular \sigma-algebras, confusing. Unfortunately Vitalli showed that \sigma-algebras can’t be avoided if we want to study probability spaces such as \mathbb{R} or an infinite number of coin tosses. One of the main reasons why \sigma-algebras can be so confusing is that it can be very hard to give concrete descriptions of all the elements of a \sigma-algebra.

We often have a collection \mathcal{G} of subsets of \Omega that we are interested in but this collection fails to be a \sigma-algebra. For example, we might have \Omega = \mathbb{R}^n and \mathcal{G} is the collection of open subsets. In this situation we take our \sigma-algebra \mathcal{F} to be \sigma(\mathcal{G}) which is the smallest \sigma-algebra containing \mathcal{G}. That is

\sigma(\mathcal{G}) = \bigcap \mathcal{F}'

where the above intersection is taken over all \sigma-algebras \mathcal{F}' that contain \mathcal{G}. In this setting we will say that \mathcal{G} generates \sigma(\mathcal{G}). When we have such a collection of generators, we might have an idea for what probability we would like to assign to sets in \mathcal{G}. That is we have a function \mathbb{P}_0 : \mathcal{G} \rightarrow [0,1] and we want to extend this function to create a probability measure \mathbb{P} : \sigma(\mathcal{G}) \rightarrow [0,1]. A famous theorem due to Caratheodory shows that we can do this in many cases.

An interesting question is whether the extension \mathbb{P} is unique. That is does there exists a probability measure \mathbb{P}' on \sigma(\mathcal{G}) such that \mathbb{P} \neq \mathbb{P}' but \mathbb{P}_{\mid \mathcal{G}} = \mathbb{P}_{\mid \mathcal{G}}'? The following theorem gives a criterion that guarantees no such \mathbb{P}' exists.

Theorem: Let \Omega be a set and let \mathcal{G} be a collection of subsets of \Omega that is closed under finite intersections. Then if \mathbb{P},\mathbb{P}' : \sigma(\mathcal{G}) \rightarrow [0,1] are two probability measures such that \mathbb{P}_{\mid \mathcal{G}} = \mathbb{P}'_{\mid \mathcal{G}}, then \mathbb{P} = \mathbb{P}'.

The above theorem is very useful for two reasons. Firstly it can be combined with Caratheodory’s extension theorem to uniquely define probability measures on a \sigma-algebra by specifying the values on a collection of simple subsets \mathcal{G}. Secondly if we ever want to show that two probability measures are equal, the above theorem tells us we can reduce the problem to checking equality on the simpler subsets in \mathcal{G}.

The condition that \mathcal{G} must be closed under finite intersections is somewhat intuitive. Suppose we had A,B \in \mathcal{G} but A \cap B \notin \mathcal{G}. We will however have A \cap B \in \sigma(\mathcal{G}) and thus we might be able to find two probability measure \mathbb{P},\mathbb{P}' : \sigma(\mathcal{G}) \rightarrow [0,1] such that \mathbb{P}(A) = \mathbb{P}'(A) and \mathbb{P}'(B)=\mathbb{P}'(B) but \mathbb{P}(A \cap B) \neq \mathbb{P}'(A \cap B). The following counterexample shows that this intuition is indeed well-founded.

When looking for examples and counterexamples, it’s good to try to keep things as simple as possible. With that in mind we will try to find a counterexample when \Omega is finite set with as few elements as possible and \sigma(\mathcal{G}) is equal to the powerset of \Omega. In this setting, a probability measure \mathbb{P}: \sigma(\mathcal{G}) \rightarrow [0,1] can be defined by specifying the values \mathbb{P}(\{\omega\}) for each \omega \in \Omega.

We will now try to find a counterexample when \Omega is as small as possible. Unfortunately we won’t be able find a counterexample when \Omega only contains one or two elements. This is because we want to find A,B \subseteq \Omega such that A \cap B is not equal to A,B or \emptyset.

Thus we will start out search with a three element set \Omega = \{a,b,c\}. Up to relabelling the elements of \Omega, the only interesting choice we have for \mathcal{G} is \{  \{a,b\} , \{b,c\} \}. This has a chance of working since \mathcal{G} is not closed under intersection. However any probability measure \mathbb{P} on \sigma(\mathcal{G}) = 2^{\{a,b,c\}} must satisfy the equations

  1. \mathbb{P}(\{a\})+\mathbb{P}(\{b\})+\mathbb{P}(\{c\}) = 1,
  2. \mathbb{P}(\{a\})+\mathbb{P} (\{b\}) = \mathbb{P}(\{a,b\}),
  3. \mathbb{P}(\{b\})+\mathbb{P}(\{c\}) = \mathbb{P}(\{b,c\}).

Thus \mathbb{P}(\{a\}) = 1- \mathbb{P}(\{b,c\}), \mathbb{P}(\{c\}) = 1-\mathbb{P}(\{a,b\}) and \mathbb{P}(\{b\})=\mathbb{P}(\{a,b\})+\mathbb{P}(\{b,c\})-1. Thus \mathbb{P} is determined by its values on \{a,b\} and \{b,c\}.

However, a four element set \{a,b,c,d\} is sufficient for our counter example! We can let \mathcal{G} = \{\{a,b\},\{b,c\}\}. Then \sigma(\mathcal{G})=2^{\{a,b,c,d\}} and we can define \mathbb{P} , \mathbb{P}' : \sigma (\mathcal{G}) \rightarrow [0,1] by

  • \mathbb{P}(\{a\}) = 0, \mathbb{P}(\{b\})=0.5, \mathbb{P}(\{c\})=0 and \mathbb{P}(\{d\})=0.5.
  • \mathbb{P}'(\{a\})=0.5, \mathbb{P}(\{b\})=0, \mathbb{P}(\{c\})=0.5 and \mathbb{P}(\{d\})=0.

Clearly \mathbb{P} \neq \mathbb{P}' however \mathbb{P}(\{a,b\})=\mathbb{P}'(\{a,b\})=0.5 and \mathbb{P}(\{b,c\})=\mathbb{P}'(\{b,c\})=0.5. Thus we have our counterexample! In general for any \lambda \in [0,1) we can define the probability measure \mathbb{P}_\lambda = \lambda\mathbb{P}+(1-\lambda)\mathbb{P}'. The measure \mathbb{P}_\lambda is not equal to \mathbb{P} but agrees with \mathbb{P} on \mathcal{G}. In general, if we have two probability measures that agree on \mathcal{G} but not on \sigma(\mathcal{G}) then we can produce uncountably many such measures by taking convex combinations as done above.