Suppose we have two samples and and we want to test if they are from the same distribution. Many popular tests can be reinterpreted as correlation tests by pooling the two samples and introducing a dummy variable that encodes which sample each data point comes from. In this post we will see how this plays out in a simple t-test.
The equal variance t-test
In the equal variance t-test, we assume that and , where is unknown. Our hypothesis that and are from the same distribution becomes the hypothesis . The test statistic is
,
where and are the two sample means. The variable is the pooled estimate of the standard deviation and is given by
.
Under the null hypothesis, follows the T-distribution with degrees of freedom. We thus reject the null when exceeds the quantile of the T-distribution.
Pooling the data
We can turn this two sample test into a correlation test by pooling the data and using a linear model. Let be the pooled data and for , define by
The assumptions that and can be rewritten as
where . That is, we have expressed our modelling assumptions as a linear model. When working with this linear model, the hypothesis is equivalent to . To test we can use the standard t-test for a coefficient in linear model. The test statistic in this case is
where is the ordinary least squares estimate of , is the design matrix and is an estimate of given by
where is the fitted value of .
It turns out that is exactly equal to . We can see this by writing out the design matrix and calculating everything above. The design matrix has rows and is thus equal to
This implies that
And therefore,
Thus, . So,
which is starting to like from the two-sample test. Now
And so
Thus, and
This means to show that , we only need to show that . To do this, note that the fitted values are equal to
Thus,
Which is exactly . Therefore, and the two sample t-test is equivalent to a correlation test.
The Friedman-Rafsky test
In the above example, we saw that the two sample t-test was a special case of the t-test for regressions. This is neat but both tests make very strong assumptions about the data. However, the same thing happens in a more interesting non-parametric setting.
In their 1979 paper, Jerome Friedman and Lawrence Rafsky introduced a two sample tests that makes no assumptions about the distribution of the data. The two samples do not even have to real-valued and can instead be from any metric space. It turns out that their test is a special case of another procedure they devised for testing for association (Friedman & Rafsky, 1983). As with the t-tests above, this connection comes from pooling the two samples and introducing a dummy variable.
The material was based on the discussion and references given in this stackexchange post. The title is a reference to a Halloween lecture on measurability given by Professor Persi Diaconis.
What’s scarier than a non-measurable set?
Making every set measurable. Or rather one particular consequence of making every set measurable.
In my talk, I argued that if you make every set measurable, then there exists a set and an equivalence relation on such that . That is, the set has strictly smaller cardinality than the set of equivalence classes . The contradictory nature of this statement is illustrated in the picture below
We can think of the set as the collection of crosses drawn above. The equivalence relation divides into the regions drawn above. The statement means that in some sense there are more regions than crosses.
To make sense of this we’ll first have to be a bit more precise about what we mean by cardinality.
What do we mean by bigger and smaller?
Let and be two sets. We say that and have the same cardinality and write if there exists a bijection function . We can think of the function as a way of pairing each element of with a unique element of such that every element of is paired with an element of .
We next want to define which means has cardinality at most the cardinality of . There are two reasonable ways in which we could try to define this relationship
We could say means that there exists an injective function .
Alternatively, we could means that there exists a surjective function .
Definitions 1 and 2 say similar things and, in the presence of the axiom of choice, they are equivalent. Since we are going to be making every set measurable in this talk, we won’t be assuming the axiom of choice. Definitions 1 and 2 are thus no longer equivalent and we have a decision to make. We will use definition . in this talk. For justification, note that definition 1 implies that there exists a subset such that . We simply take to be the range of . This is a desirable property of the relation and it’s not clear how this could be done using definition 2.
Infinite binary sequences
It’s time to introduce the set and the equivalence relation we will be working with. The set is the set the set of all function . We can think of each elements as an infinite sequence of zeros and ones stretching off in both directions. For example
.
But this analogy hides something important. Each has a “middle” which is the point . For instance, the two sequences below look the same but when we make bold we see that they are different.
,
.
The equivalence relation on can be thought of as forgetting the location . More formally we have if and only if there exists such that for all . That is, if we shift the sequence by we get the sequence . We will use to denote the equivalence class of and for the set of all equivalences classes.
Some probability
Associated with the space are functions , one for each integer . These functions simply evaluate at . That is . A probabilist or statistician would think of as reporting the result of one of infinitely many independent coin tosses. Normally to make this formal we would have to first define a -algebra on and then define a probability on this -algebra. Today we’re working in a world where every set is measurable and so don’t have to worry about -algebras. Indeed we have the following result:
(Solovay, 1970)1There exists a model of the Zermelo Fraenkel axioms of set theory such that there exists a probability defined on all subsets of such that are i.i.d. .
This result is saying that there is world in which, other than the axiom of choice, all the regular axioms of set theory holds. And in this world, we can assign a probability to every subset in a way so that the events are all independent and have probability . It’s important to note that this is a true countably additive probability and we can apply all our familiar probability results to . We are now ready to state and prove the spooky result claimed at the start of this talk.
Proposition: Given the existence of such a probability , .
Proof: Let be any function. To show that we need to show that is not injective. To do this, we’ll first define another function given by . That is, first maps to ‘s equivalence class and then applies to this equivalence class. This is illustrated below.
A commutative diagram showing the definition of as .
We will show that is almost surely constant with respect to . That is, there exists such that . Each equivalence class is finite or countable and thus has probability zero under . This means that if is almost surely constant, then cannot be injective and must map multiple (in fact infinitely many) equivalence classes to .
It thus remains to show that is almost surely constant. To do this we will introduce a third function . The map is simply the shift map and is given by . Note that and are in the same equivalence class for every . Thus, the map satisfies . That is is -invariant.
The map is ergodic. This means that if satisfies , then equals or . For example if is the event that appears at some point in , then and . Likewise if is the event that the relative frequency of heads converges to a number strictly greater than , then and . The general claim that all -invariant events have probability or can be proved using the independence of .
For each , define an event by . Since is -invariant we have that . Thus, or . This gives us a function given by . Note that for every , . This is because if , then , by definition of . Likewise if , then and hence . Thus, in both cases, .
Since is a probability measure, we can conclude that
.
Thus, map to with probability one. Showing that is almost surely constant and hence that is not injective.
There’s a catch!
So we have proved that there cannot be an injective map . Does this mean we have proved ? Technically no. We have proved the negation of which does not imply . To argue that we need to produce a map that is injective. Surprising this is possible and not too difficult. The idea is to find a map such that implies that . This can be done by somehow encoding in where the centre of is.
A simpler proof and other examples
Our proof was nice because we explicitly calculated the value where sent almost all of . We could have been less explicit and simply noted that the function was measurable with respect to the invariant -algebra of and hence almost surely constant by the ergodicity of .
This quicker proof allows us to generalise our “spooky result” to other sets. Below are two examples where
Fix and define if and only if for some .
if and only if .
A similar argument can be used to show that in Solovay’s world . The exact same argument follows from the ergodicity of the corresponding actions on under the uniform measure.
Three takeaways
I hope you agree that this example is good fun and surprising. I’d like to end with some remarks.
The first remark is some mathematical context. This argument given today is linked to some interesting mathematics called descriptive set theory. This field studies the properties of well behaved subsets (such as Borel subsets) of topological spaces. Descriptive set theory incorporates logic, topology and ergodic theory. I don’t know much about the field but in Persi’s Halloween talk he said that one “monster” was that few people are interested in the subject.
The next remark is a better way to think about our “spooky result”. The result is really saying something about cardinality. When we no longer use the axiom of choice, cardinality becomes a subtle concept. The statement no longer corresponds to being “smaller” than but rather that is “less complex” than . This is perhaps analogous to some statistical models which may be “large” but do not overfit due to subtle constraints on the model complexity.
In light of the previous remark, I would invite you to think about whether the example I gave is truly spookier than non-measurable sets. It might seem to you that it is simply a reasonable consequence of removing the axiom of choice and restricting ourselves to functions we could actually write down or understand. I’ll let you decide
Footnotes
Technically Solovay proved that there exists a model of set theory such that every subset of is Borel measurable. To get the result for binary sequences we have to restrict to and use the binary expansion of to define a function . Solvay’s paper is available here https://www.jstor.org/stable/1970696?seq=1
The fundamental theorem of arithmetic states that every natural number can be factorized uniquely as a product of prime numbers. The word “uniquely” here means unique up to rearranging. The theorem means that if you and I take the same number and I write and you write where each and is a prime number, then in fact and we wrote the same prime numbers (but maybe in a different order).
Most people happily accept this theorem as self evident and believe it without proof. Indeed some people take it to be so self evident they feel it doesn’t really deserve the name “theorem” – hence the title of this blog post. In this post I want to highlight two situations where an analogous theorem fails.
Situation One: The Even Numbers
Imagine a world where everything comes in twos. In this world nobody knows of the number one or indeed any odd number. Their counting numbers are the even numbers . People in this world can add numbers and multiply numbers just like we can. They can even talk about divisibility, for example divides since . Note that things are already getting a bit strange in this world. Since there is no number one, numbers in this world do not divide themselves.
Once people can talk about divisibility, they can talk about prime numbers. A number is prime in this world if it is not divisible by any other number. For example is prime but as we saw is not prime. Surprisingly the number is also prime in this world. This is because there are no two even numbers that multiply together to make .
If a number is not prime in this world, we can reduce it to a product of primes. This is because if is not prime, then there are two number and such that . Since and are both smaller than , we can apply the same argument and recursively write as a product of primes.
Now we can ask whether or not the fundamental theorem of arthimetic holds in this world. Namely we want to know if their is a unique way to factorize each number in this world. To get an idea we can start with some small even numbers.
is prime.
can be factorized uniquely.
is prime.
can be factorized uniquely.
is prime.
can be factorized uniquely.
is prime.
can be factorized uniquely.
is prime.
can be factorized uniquely.
Thus it seems as though there might be some hope for this theorem. It at least holds for the first handful of numbers. Unfortunately we eventually get to and we have:
and .
Thus there are two distinct ways of writing as a product of primes in this world and thus the fundamental theorem of arithmetic does not hold.
Situtation Two: A Number Ring
While the first example is fun and interesting, it is somewhat artificial. We are unlikely to encounter a situation where we only have the even numbers. It is however common and natural for mathematicians to be lead into certain worlds called number rings. We will see one example here and see what an effect the fundamental theorem of arithmetic can have.
Consider wanting to solve the equation where and are both integers. One way to try to solve this is by rewriting the equation as . With this rewriting we have left the familiar world of the whole numbers and entered the number ring .
In all numbers have the form , where and are integers. Addition of two such numbers is defined like so
.
Multiplication is define by using the distributive law and the fact that . Thus
.
Since we have multiplication we can talk about when a number in divides another and hence define primes in . One can show that if , then and are coprime in (see the references at the end of this post).
This means that there are no primes in that divides both and . If we assume that the fundamental theorem of arthimetic holds in , then this implies that must itself be a cube. This is because is a cube and if two coprime numbers multiply to be a cube, then both of those coprime numbers must be cubes.
Thus we can conclude that there are integers and such that . If we expand out this cube we can conclude that
.
Thus in particular we have . This implies that and . Hence and . Now if , then – a contradiction. Similarly if , then – another contradiction. Thus we can conclude there are no integer solutions to the equation !
Unfortunately however, a bit of searching reveals that . Thus simply assuming that that the ring has unique factorization led us to incorrectly conclude that an equation had no solutions. The question of unique factorization in number rings such as is a subtle and important one. Some of the flawed proofs of Fermat’s Last Theorem incorrectly assume that certain number rings have unique factorization – like we did above.
References
The lecturer David Smyth showed us that the even integers do not have unique factorization during a lecture of the great course MATH2222.
The example of failing to have unique factorization and the consequences of this was shown in a lecture for a course on algebraic number theory by James Borger. In this class we followed the (freely available) textbook “Number Rings” by P. Stevenhagen. Problem 1.4 on page 8 is the example I used in this post. By viewing the textbook you can see a complete solution to the problem.