## A clock is a one-dimensional subgroup of the torus

Recently, my partner and I installed a clock in our home. The clock previously belonged to my grandparents and we have owned it for a while. We hadn’t put it up earlier because the original clock movement ticked and the sound would disrupt our small studio apartment. After much procrastinating, I bought a new clock movement, replaced the old one and proudly hung up our clock.

When I first put on the clock hands I made the mistake of not putting them both on at exactly 12 o’clock. This meant that the minute and hour hands were not synchronised. The hands were in an impossible position. At times, the minute hand was at 12 and the hour hand was between 3 and 4. It took some time for me to register my mistake as at some times of the day it can be hard to tell that the hands are out of sync (how often do you look at a clock at 12:00 exactly?). Fortunately, I did notice the mistake and we have a correct clock. Now I can’t help noticing when others make the same mistake such as in this piece of clip art.

After fixing the clock, I was still thinking about how only some clock hand positions correspond to actual times. This led me to think “a clock is a one-dimensional subgroup of the torus”. Let me explain why.

### The torus

The minute and hour hands on a clock can be thought of as two points on two different circles. For instance, if the time is 9:30, then the minute hand corresponds to a point at the very bottom of the circle and the hour hand corresponds to a point 15 degrees clockwise of the leftmost point of the circle. As a clock goes through a 12 hour cycle the minute-hand-point goes around the circle 12 times and the hour-hand-point goes around the circle once. This is shown below.

If you take the collection of all pairs of points on a circle you get what mathematicians call a torus. The torus is a geometric shape that looks like the surface of a donut. The torus is defined as the Cartesian product of two circles. That is, a single point on the torus corresponds to two points on two different circles. A torus is plotted below.

To understand the torus, it’s helpful to consider a more familiar example, the 2-dimensional plane. If we have points $x$ and $y$ on two different lines, then we can produce the point $(x,y)$ in the two dimensional plane. Likewise, if we have a point $p$ and a point $q$ on two different circles, then we can produce a point $(p,q)$ on the torus. Both of these concepts are illustrated below. I have added two circles to the torus which are analogous to the x and y axes of the plane. The blue and red points on the blue and red circle produce the black point on the torus.

### Mapping the clock to the torus

The points on the torus are in one-to-one correspondence with possible arrangements of the two clock hands. However, as I learnt putting up our clock, not all arrangements of clock hands correspond to an actual time. This means that only some points on the torus correspond to an actual time but how can we identify these points?

Keeping with our previous convention, let’s use the blue circle to represent the position of the minute hand and the red circle to represent the position of the hour hand. This means that the point where the two circles meet corresponds to 12 o’clock.

There are eleven other points on the red line that correspond to the other times when the minute hand is at 12. That is, there’s a point for 1 o’clock, 2 o’clock, 3 o’clock and so on. Once we add in those points, our torus looks like this:

Finally, we have to join these points together. We know that when the hour hand moves from 12 to 1, the minute hand does one full rotation. This means that we have to join the black points by making one full rotation in the direction of the blue circle. The result is the black curve below that snakes around the torus.

The picture above should explain most of this blog’s title – “a clock is a one-dimensional subgroup of the torus”. We now know what the torus is and why certain points on the torus correspond to positions of the hands on a clock. We can see that these “clock points” correspond to a line that snakes around the torus. While the torus is a surface and hence two dimensional, the line is one-dimensional. The last missing part is the word “subgroup”. I won’t go into the details here but the torus has some extra structure that makes it something called a group. Our map from the clock to the torus interacts nicely with this structure and this makes the black line a “subgroup”.

### Another perspective

While the above pictures of the torus are pretty, they can be a bit hard to understand and hard to draw. Mathematicians have another perspective of the torus that is often easier to work with. Imagine that you have a square sheet of rubber. If you rolled up the rubber and joined a pair of opposite sides, you would get a rubber tube. If you then bent the tube to join the opposite sides again, you would get a torus! The gif bellow illustrates this idea

This means that we can simply view the torus as a square. We just have to remember that the opposite sides of the squares have been glued together. So like a game of snake on a phone, if you leave the top of the square, you come out at the same place on the bottom of the square. If we use this idea to redraw our torus it now looks like this:

As before we can draw in the other points when the minute hand is at 12. These points correspond to 1 o’clock, 2 o’clock, 3 o’clock…

Finally we can draw in all the other times on the clock. This is the result:

One nice thing about this picture is that it can help us answer a classic riddle. In a 12-hour cycle, how many times are the minute and hour hands on top of each other? We can answer this riddle by adding a second line to the above square. The bottom-left to top-right diagonal is the collection of all hand positions where the two hands are on top of each other. Let’s add that line in green and add the points where this new line intersects the black line.

The points where the green and black lines intersect are hand positions where the clock hands are directly on top of each other and which correspond to actual times. Thus we can count that there are exactly 11 times when the hands are on top of each other in a 12-hour cycle. It might look like there are 12 such times but we have to remember that the corners of the square are all the same point on the torus.

So far I have ignored the second hand on the clock. If we included the second hand, we would have three points on three different circles. The corresponding geometric object is a 3-dimensional torus. The 3-dimensional torus is what you get when you take a cube and glue together the three pairs of opposite faces (don’t worry if you have trouble visualising such a shape!).

The points on the 3-dimensional torus which correspond to actual times will again be a line that wraps around the 3-dimensional torus. You could use this line to find out how many times the three hands are all on top of each other! Let me know if you work it out.

I hope that if you’re ever asked to define a clock, you’d at least consider saying “a clock is a one-dimensional subgroup of the torus” and you could even tell them which subgroup!

## Sampling from the non-central chi-squared distribution

The non-central chi-squared distribution is a generalisation of the regular chi-squared distribution. The chi-squared distribution turns up in many statistical tests as the (approximate) distribution of a test statistic under the null hypothesis. Under alternative hypotheses, those same statistics often have approximate non-central chi-squared distributions.

This means that the non-central chi-squared distribution is often used to study the power of said statistical tests. In this post I give the definition of the non-central chi-squared distribution, discuss an important invariance property and show how to efficiently sample from this distribution.

### Definition

Let $Z$ be a normally distributed random vector with mean $0$ and covariance $I_n$. Given a vector $\mu \in \mathbb{R}^n$, the non-central chi-squared distribution with $n$ degrees of freedom and non-centrality parameter $\Vert \mu\Vert_2^2$ is the distribution of the quantity

$\Vert Z+\mu \Vert_2^2 = \sum\limits_{i=1}^n (Z_i+\mu_i)^2.$

This distribution is denoted by $\chi^2_n(\Vert \mu \Vert_2^2)$. As this notation suggests, the distribution of $\Vert Z+\mu \Vert_2^2$ depends only on $\Vert \mu \Vert_2^2$, the norm of $\mu$. The first few times I heard this fact, I had no idea why it would be true (and even found it a little spooky). But, as we will see below, the result is actually a simply consequence of the fact that standard normal vectors are invariant under rotations.

### Rotational invariance

Suppose that we have two vectors $\mu, \nu \in \mathbb{R}^n$ such that $\Vert \mu\Vert_2^2 = \Vert \nu \Vert_2^2$. We wish to show that if $Z \sim \mathcal{N}(0,I_n)$, then

$\Vert Z+\mu \Vert_2^2$ has the same distribution as $\Vert Z + \nu \Vert_2^2$.

Since $\mu$ and $\nu$ have the same norm there exists an orthogonal matrix $U \in \mathbb{R}^{n \times n}$ such that $U\mu = \nu$. Since $U$ is orthogonal and $Z \sim \mathcal{N}(0,I_n)$, we have $Z'=UZ \sim \mathcal{N}(U0,UU^T) = \mathcal{N}(0,I_n)$. Furthermore, since $U$ is orthogonal, $U$ preserves the norm $\Vert \cdot \Vert_2^2$. This is because, for all $x \in \mathbb{R}^n$,

$\Vert Ux\Vert_2^2 = (Ux)^TUx = x^TU^TUx=x^Tx=\Vert x\Vert_2^2.$

Putting all these pieces together we have

$\Vert Z+\mu \Vert_2^2 = \Vert U(Z+\mu)\Vert_2^2 = \Vert UZ + U\mu \Vert_2^2 = \Vert Z'+\nu \Vert_2^2$.

Since $Z$ and $Z'$ have the same distribution, we can conclude that $\Vert Z'+\nu \Vert_2^2$ has the same distribution as $\Vert Z + \nu \Vert$. Since $\Vert Z + \mu \Vert_2^2 = \Vert Z'+\nu \Vert_2^2$, we are done.

### Sampling

Above we showed that the distribution of the non-central chi-squared distribution, $\chi^2_n(\Vert \mu\Vert_2^2)$ depends only on the norm of the vector $\mu$. We will now use this to provide an algorithm that can efficiently generate samples from $\chi^2_n(\Vert \mu \Vert_2^2)$.

A naive way to sample from $\chi^2_n(\Vert \mu \Vert_2^2)$ would be to sample $n$ independent standard normal random variables $Z_i$ and then return $\sum_{i=1}^n (Z_i+\mu_i)^2$. But for large values of $n$ this would be very slow as we have to simulate $n$ auxiliary random variables $Z_i$ for each sample from $\chi^2_n(\Vert \mu \Vert_2^2)$. This approach would not scale well if we needed many samples.

An alternative approach uses the rotation invariance described above. The distribution $\chi^2_n(\Vert \mu \Vert_2^2)$ depends only on $\Vert \mu \Vert_2^2$ and not directly on $\mu$. Thus, given $\mu$, we could instead work with $\nu = \Vert \mu \Vert_2 e_1$ where $e_1$ is the vector with a $1$ in the first coordinate and $0$s in all other coordinates. If we use $\nu$ instead of $\mu$, we have

$\sum\limits_{i=1}^n (Z_i+\nu_i)^2 = (Z_1+\Vert \mu \Vert_2)^2 + \sum\limits_{i=2}^{n}Z_i^2.$

The sum $\sum_{i=2}^n Z_i^2$ follows the regular chi-squared distribution with $n-1$ degrees of freedom and is independent of $Z_1$. The regular chi-squared distribution is a special case of the gamma distribution and can be effectively sampled with rejection sampling for large shape parameter (see here).

The shape parameter for $\sum_{i=2}^n Z_i^2$ is $\frac{n-1}{2}$, so for large values of $n$ we can efficiently sample a value $Y$ that follows that same distribution as $\sum_{i=2}^n Z_i^2 \sim \chi^2_{n-1}$. Finally to get a sample from $\chi^2_n(\Vert \mu \Vert_2^2)$ we independently sample $Z_1$, and then return the sum $(Z_1+\Vert \mu\Vert_2)^2 +Y$.

### Conclusion

In this post, we saw that the rotational invariance of the standard normal distribution gives a similar invariance for the non-central chi-squared distribution.

This invariance allowed us to efficiently sample from the non-central chi-squared distribution. The sampling procedure worked by reducing the problem to sampling from the regular chi-squared distribution.

The same invariance property is also used to calculate the cumulative distribution function and density of the non-central chi-squared distribution. Although the resulting formulas are not for the faint of heart.

## Non-measurable sets, cardinality and the axiom of choice

The following post is based on a talk I gave at the 2022 Stanford statistics retreat. The talk was titled “Another non-measurable monster”.

The material was based on the discussion and references given in this stackexchange post. The title is a reference to a Halloween lecture on measurability given by Professor Persi Diaconis.

### What’s scarier than a non-measurable set?

Making every set measurable. Or rather one particular consequence of making every set measurable.

In my talk, I argued that if you make every set measurable, then there exists a set $\Omega$ and an equivalence relation $\sim$ on $\Omega$ such that $|\Omega| < |\Omega / \sim|$. That is, the set $\Omega$ has strictly smaller cardinality than the set of equivalence classes $\Omega/\sim$. The contradictory nature of this statement is illustrated in the picture below

To make sense of this we’ll first have to be a bit more precise about what we mean by cardinality.

### What do we mean by bigger and smaller?

Let $A$ and $B$ be two sets. We say that $A$ and $B$ have the same cardinality and write $|A| = |B|$ if there exists a bijection function $f:A \to B$. We can think of the function $f$ as a way of pairing each element of $A$ with a unique element of $B$ such that every element of $B$ is paired with an element of $A$.

We next want to define $|A|\le |B|$ which means $A$ has cardinality at most the cardinality of $B$. There are two reasonable ways in which we could try to define this relationship

1. We could say $|A|\le |B|$ means that there exists an injective function $f : A \to B$.
2. Alternatively, we could $|A|\le |B|$ means that there exists a surjective function $g:B \to A$.

Definitions 1 and 2 say similar things and, in the presence of the axiom of choice, they are equivalent. Since we are going to be making every set measurable in this talk, we won’t be assuming the axiom of choice. Definitions 1 and 2 are thus no longer equivalent and we have a decision to make. We will use definition . in this talk. For justification, note that definition 1 implies that there exists a subset $B' \subseteq B$ such that $|A|=|B|$. We simply take $B'$ to be the range of $f$. This is a desirable property of the relation $|A|\le |B|$ and it’s not clear how this could be done using definition 2.

### Infinite binary sequences

It’s time to introduce the set $\Omega$ and the equivalence relation we will be working with. The set $\Omega$ is the set $\{0,1\}^\mathbb{Z}$ the set of all function $\omega : \mathbb{Z} \to \{0,1\}$. We can think of each elements $\omega \in \Omega$ as an infinite sequence of zeros and ones stretching off in both directions. For example

$\omega = \ldots 1110110100111\ldots$.

But this analogy hides something important. Each $\omega \in \Omega$ has a “middle” which is the point $\omega_0$. For instance, the two sequences below look the same but when we make $\omega_0$ bold we see that they are different.

$\omega = \ldots 111011\mathbf{0}100111\ldots$,

$\omega' = \ldots 1110110\mathbf{1}00111\ldots$.

The equivalence relation $\sim$ on $\Omega$ can be thought of as forgetting the location $\omega_0$. More formally we have $\omega \sim \omega'$ if and only if there exists $n \in \mathbb{Z}$ such that $\omega_{n+k} = \omega_{k}'$ for all $k \in \mathbb{Z}$. That is, if we shift the sequence $\omega$ by $n$ we get the sequence $\omega'$. We will use $[\omega]$ to denote the equivalence class of $\omega$ and $\Omega/\sim$ for the set of all equivalences classes.

### Some probability

Associated with the space $\Omega$ are functions $X_k : \Omega \to \{0,1\}$, one for each integer $k \in \mathbb{Z}$. These functions simply evaluate $\omega$ at $k$. That is $X_k(\omega)=\omega_k$. A probabilist or statistician would think of $X_k$ as reporting the result of one of infinitely many independent coin tosses. Normally to make this formal we would have to first define a $\sigma$-algebra on $\Omega$ and then define a probability on this $\sigma$-algebra. Today we’re working in a world where every set is measurable and so don’t have to worry about $\sigma$-algebras. Indeed we have the following result:

(Solovay, 1970)1 There exists a model of the Zermelo Fraenkel axioms of set theory such that there exists a probability $\mathbb{P}$ defined on all subsets of $\Omega$ such that $X_k$ are i.i.d. $\mathrm{Bernoulli}(0.5)$.

This result is saying that there is world in which, other than the axiom of choice, all the regular axioms of set theory holds. And in this world, we can assign a probability to every subset $A \subseteq \Omega$ in a way so that the events $\{X_k=1\}$ are all independent and have probability $0.5$. It’s important to note that this is a true countably additive probability and we can apply all our familiar probability results to $\mathbb{P}$. We are now ready to state and prove the spooky result claimed at the start of this talk.

Proposition: Given the existence of such a probability $\mathbb{P}$, $|\Omega | < |\Omega /\sim|$.

Proof: Let $f:\Omega/\sim \to \Omega$ be any function. To show that $|\Omega|<|\Omega /\sim|$ we need to show that $f$ is not injective. To do this, we’ll first define another function $g:\Omega \to \Omega$ given by $g(\omega)=f([\omega])$. That is, $g$ first maps $\omega$ to $\omega$‘s equivalence class and then applies $f$ to this equivalence class. This is illustrated below.

We will show that $g : \Omega \to \Omega$ is almost surely constant with respect to $\mathbb{P}$. That is, there exists $\omega^\star \in \Omega$ such that $\mathbb{P}(g(\omega)=\omega^\star)=1$. Each equivalence class $[\omega]$ is finite or countable and thus has probability zero under $\mathbb{P}$. This means that if $g$ is almost surely constant, then $f$ cannot be injective and must map multiple (in fact infinitely many) equivalence classes to $\omega^\star$.

It thus remains to show that $g:\Omega \to \Omega$ is almost surely constant. To do this we will introduce a third function $\varphi : \Omega \to \Omega$. The map $\varphi$ is simply the shift map and is given by $\varphi(\omega)_k = \omega_{k+1}$. Note that $\omega$ and $\varphi(\omega)$ are in the same equivalence class for every $\omega\in \Omega$. Thus, the map $g$ satisfies $g\circ \varphi = g$. That is $g$ is $\varphi$-invariant.

The map $\varphi$ is ergodic. This means that if $A \subseteq \Omega$ satisfies $\varphi(A)=A$, then $\mathbb{P}(A)$ equals $0$ or $1$. For example if $A$ is the event that $10110$ appears at some point in $\omega$, then $\varphi(A)=A$ and $\mathbb{P}(A)=1$. Likewise if $A$ is the event that the relative frequency of heads converges to a number strictly greater than $0.5$, then $\varphi(A)=A$ and $\mathbb{P}(A)=0$. The general claim that all $\varphi$-invariant events have probability $0$ or $1$ can be proved using the independence of $X_k$.

For each $k$, define an event $A_k$ by $A_k = \{\omega : g(\omega)_k = 1\}$. Since $g$ is $\varphi$-invariant we have that $\varphi(A_k)=A_k$. Thus, $\mathbb{P}(A_k)=0$ or $1$. This gives us a function $\omega^\star :\mathbb{Z} \to \{0,1\}$ given by $\omega^\star_k = \mathbb{P}(A_k)$. Note that for every $k$, $\mathbb{P}(\{\omega : g(\omega)_k = \omega_k^\star\}) = 1$. This is because if $w_{k}^\star=1$, then $\mathbb{P}(\{\omega: g(\omega)_k = 1\})=1$, by definition of $w_k^\star$. Likewise if $\omega_k^\star =0$, then $\mathbb{P}(\{\omega:g(\omega)_k=1\})=0$ and hence $\mathbb{P}(\{\omega:g(\omega)_k=0\})=1$. Thus, in both cases, $\mathbb{P}(\{\omega : g(\omega)_k = \omega_k^*\})= 1$.

Since $\mathbb{P}$ is a probability measure, we can conclude that

$\mathbb{P}(\{\omega : g(\omega)=\omega^\star\}) = \mathbb{P}\left(\bigcap_{k \in \mathbb{Z}} \{\omega : g(\omega)_k = \omega_k^\star\}\right)=1$.

Thus, $g$ map $\Omega$ to $\omega^\star$ with probability one. Showing that $g$ is almost surely constant and hence that $f$ is not injective. $\square$

### There’s a catch!

So we have proved that there cannot be an injective map $f : \Omega/\sim \to \Omega$. Does this mean we have proved $|\Omega| < |\Omega/\sim|$? Technically no. We have proved the negation of $|\Omega/\sim|\le |\Omega|$ which does not imply $|\Omega| \le |\Omega/\sim|$. To argue that $|\Omega| < |\Omega/\sim|$ we need to produce a map $g: \Omega \to \Omega/\sim$ that is injective. Surprising this is possible and not too difficult. The idea is to find a map $g : \Omega \to \Omega$ such that $g(\omega)\sim g(\omega')$ implies that $\omega = \omega'$. This can be done by somehow encoding in $g(\omega)$ where the centre of $\omega$ is.

### A simpler proof and other examples

Our proof was nice because we explicitly calculated the value $\omega^\star$ where $g$ sent almost all of $\Omega$. We could have been less explicit and simply noted that the function $g:\Omega \to \Omega$ was measurable with respect to the invariant $\sigma$-algebra of $\varphi$ and hence almost surely constant by the ergodicity of $\varphi$.

This quicker proof allows us to generalise our “spooky result” to other sets. Below are two examples where $\Omega = [0,1)$

• Fix $\theta \in [0,1)\setminus \mathbb{Q}$ and define $\omega \sim \omega'$ if and only if $\omega + n \theta= \omega'$ for some $n \in \mathbb{Z}$.
• $\omega \sim \omega'$ if and only if $\omega - \omega' \in \mathbb{Q}$.

A similar argument can be used to show that in Solovay’s world $|\Omega| < |\Omega/\sim|$. The exact same argument follows from the ergodicity of the corresponding actions on $\Omega$ under the uniform measure.

### Three takeaways

I hope you agree that this example is good fun and surprising. I’d like to end with some remarks.

• The first remark is some mathematical context. This argument given today is linked to some interesting mathematics called descriptive set theory. This field studies the properties of well behaved subsets (such as Borel subsets) of topological spaces. Descriptive set theory incorporates logic, topology and ergodic theory. I don’t know much about the field but in Persi’s Halloween talk he said that one “monster” was that few people are interested in the subject.
• The next remark is a better way to think about our “spooky result”. The result is really saying something about cardinality. When we no longer use the axiom of choice, cardinality becomes a subtle concept. The statement $|A|\le |B|$ no longer corresponds to $A$ being “smaller” than $B$ but rather that $A$ is “less complex” than $B$. This is perhaps analogous to some statistical models which may be “large” but do not overfit due to subtle constraints on the model complexity.
• In light of the previous remark, I would invite you to think about whether the example I gave is truly spookier than non-measurable sets. It might seem to you that it is simply a reasonable consequence of removing the axiom of choice and restricting ourselves to functions we could actually write down or understand. I’ll let you decide

### Footnotes

1. Technically Solovay proved that there exists a model of set theory such that every subset of $\mathbb{R}$ is Borel measurable. To get the result for binary sequences we have to restrict to $[0,1)$ and use the binary expansion of $x \in [0,1)$ to define a function $[0,1) \to \Omega$. Solvay’s paper is available here https://www.jstor.org/stable/1970696?seq=1

## The Singular Value Decomposition (SVD)

The singular value decomposition (SVD) is a powerful matrix decomposition. It is used all the time in statistics and numerical linear algebra. The SVD is at the heart of the principal component analysis, it demonstrates what’s going on in ridge regression and it is one way to construct the Moore-Penrose inverse of a matrix. For more SVD love, see the tweets below.

In this post I’ll define the SVD and prove that it always exists. At the end we’ll look at some pictures to better understand what’s going on.

### Definition

Let $X$ be a $n \times p$ matrix. We will define the singular value decomposition first in the case $n \ge p$. The SVD consists of three matrix $U \in \mathbb{R}^{n \times p}, \Sigma \in \mathbb{R}^{p \times p}$ and $V \in \mathbb{R}^{p \times p}$ such that $X = U\Sigma V^T$. The matrix $\Sigma$ is required to be diagonal with non-negative diagonal entries $\sigma_1 \ge \sigma_2 \ge \ldots \ge \sigma_p \ge 0$. These numbers are called the singular values of $X$. The matrices $U$ and $V$ are required to orthogonal matrices so that $U^TU=V^TV = I_p$, the $p \times p$ identity matrix. Note that since $V$ is square we also have $VV^T=I_p$ however we won’t have $UU^T = I_n$ unless $n = p$.

In the case when $n \le p$, we can define the SVD of $X$ in terms of the SVD of $X^T$. Let $\widetilde{U} \in \mathbb{R}^{p \times n}, \widetilde{\Sigma} \in \mathbb{R}^{n \times n}$ and $\widetilde{V} \in \mathbb{R}^{n \times n}$ be the SVD of $X^T$ so that $X^T=\widetilde{U}\widetilde{\Sigma}\widetilde{V}^T$. The SVD of $X$ is then given by transposing both sides of this equation giving $U = \widetilde{V}, \Sigma = \widetilde{\Sigma}^T=\widetilde{\Sigma}$ and $V = \widetilde{U}$.

### Construction

The SVD of a matrix can be found by iteratively solving an optimisation problem. We will first describe an iterative procedure that produces matrices $U \in \mathbb{R}^{n \times p}, \Sigma \in \mathbb{R}^{p \times p}$ and $V \in \mathbb{R}^{p \times p}$. We will then verify that $U,\Sigma$ and $V$ satisfy the defining properties of the SVD.

We will construct the matrices $U$ and $V$ one column at a time and we will construct the diagonal matrix $\Sigma$ one entry at a time. To construct the first columns and entries, recall that the matrix $X$ is really a linear function from $\mathbb{R}^p$ to $\mathbb{R}^n$ given by $v \mapsto Xv$. We can thus define the operator norm of $X$ via

$\Vert X \Vert = \sup\left\{ \|Xv\|_2 : \|v\|_2 =1\right\},$

where $\|v\|_2$ represents the Euclidean norm of $v \in \mathbb{R}^p$ and $\|Xv\|_2$ is the Euclidean norm of $Xv \in \mathbb{R}^n$. The set of vectors $\{v \in \mathbb{R} : \|v\|_2 = 1 \}$ is a compact set and the function $v \mapsto \|Xv\|_2$ is continuous. Thus, the supremum used to define $\Vert X \Vert$ is achieved at some vector $v_1 \in \mathbb{R}^p$. Define $\sigma_1 = \|X v_1\|_2$. If $\sigma_1 \neq 0$, then define $u_1 = Xv_1/\sigma_1 \in \mathbb{R}^n$. If $\sigma_1 = 0$, then define $u_1$ to be an arbitrary vector in $\mathbb{R}^n$ with $\|u\|_2 = 1$. To summarise we have

• $v_1 \in \mathbb{R}^p$ with $\|v_1\|_2 = 1$.
• $\sigma_1 = \|X\| = \|Xv_1\|_2$.
• $u_1 \in \mathbb{R}^n$ with $\|u_1\|_2=1$ and $Xv_1 = \sigma_1u_1$.

We have now started to fill in our SVD. The number $\sigma_1 \ge 0$ is the first singular value of $X$ and the vectors $v_1$ and $u_1$ will be the first columns of the matrices $V$ and $U$ respectively.

Now suppose that we have found the first $k$ singular values $\sigma_1,\ldots,\sigma_k$ and the first $k$ columns of $V$ and $U$. If $k = p$, then we are done. Otherwise we repeat a similar process.

Let $v_1,\ldots,v_k$ and $u_1,\ldots,u_k$ be the first $k$ columns of $V$ and $U$. The vectors $v_1,\ldots,v_k$ split $\mathbb{R}^p$ into two subspaces. These subspaces are $S_1 = \text{span}\{v_1,\ldots,v_k\}$ and $S_2 = S_1^\perp$, the orthogonal compliment of $S_1$. By restricting $X$ to $S_2$ we get a new linear map $X_{|S_2} : S_2 \to \mathbb{R}^n$. Like before, the operator norm of $X_{|S_2}$ is defined to be

$\|X_{|S_2}\| = \sup\{\|X_{|S_2}v\|_2:v \in S_2, \|v\|_2=1\}$.

Since $S_2 = \text{span}\{v_1,\ldots,v_k\}^\perp$ we must have

$\|X_{|S_2}\| = \sup\{\|Xv\|_2:v \in \mathbb{R}^p, \|v\|_2=1, v_j^Tv = 0 \text{ for } j=1,\ldots,k\}.$

The set $\{v \in \mathbb{R}^p : \|v\|_2=1, v_j^Tv=0\text{ for } j=1,\ldots,k\}$ is a compact set and thus there exists a vector $v_{k+1}$ such that $\|Xv_{k+1}\|_2 = \|X_{|S_2}\|$. As before define $\sigma_{k+1} = \|Xv_{k+1}\|_2$ and $u_{k+1} = Xv_{k+1}/\sigma_{k+1}$ if $\sigma_{k+1}\neq 0$. If $\sigma_{k+1} = 0$, then define $u_{k+1}$ to be any vector in $\mathbb{R}^{n}$ that is orthogonal to $u_1,u_2,\ldots,u_k$.

This process repeats until eventually $k = p$ and we have produced matrices $U \in \mathbb{R}^{n \times p}, \Sigma \in \mathbb{R}^{p \times p}$ and $V \in \mathbb{R}^{p \times p}$. In the next section, we will argue that these three matrices satisfy the properties of the SVD.

### Correctness

The defining properties of the SVD were given at the start of this post. We will see that most of the properties follow immediately from the construction but one of them requires a bit more analysis. Let $U = [u_1,\ldots,u_p]$, $\Sigma = \text{diag}(\sigma_1,\ldots,\sigma_p)$ and $V= [v_1,\ldots,v_p]$ be the output from the above construction.

First note that by construction $v_1,\ldots, v_p$ are orthogonal since we always had $v_{k+1} \in \text{span}\{v_1,\ldots,v_k\}^\perp$. It follows that the matrix $V$ is orthogonal and so $V^TV=VV^T=I_p$.

The matrix $\Sigma$ is diagonal by construction. Furthermore, we have that $\sigma_{k+1} \le \sigma_k$ for every $k$. This is because both $\sigma_k$ and $\sigma_{k+1}$ were defined as maximum value of $\|Xv\|_2$ over different subsets of $\mathbb{R}^p$. The subset for $\sigma_k$ contained the subset for $\sigma_{k+1}$ and thus $\sigma_k \ge \sigma_{k+1}$.

We’ll next verify that $X = U\Sigma V^T$. Since $V$ is orthogonal, the vectors $v_1,\ldots,v_p$ form an orthonormal basis for $\mathbb{R}^p$. It thus suffices to check that $Xv_k = U\Sigma V^Tv_k$ for $k = 1,\ldots,p$. Again by the orthogonality of $V$ we have that $V^Tv_k = e_k$, the $k^{th}$ standard basis vector. Thus,

$U\Sigma V^Tv_k = U\Sigma e_k = U\sigma_k e_k = \sigma_k u_k.$

Above, we used that $\Sigma$ was a diagonal matrix and that $u_k$ is the $k^{th}$ column of $U$. If $\sigma_k \neq 0$, then $\sigma_k u_k = Xv_k$ by definition. If $\sigma_k =0$, then $\|Xv_k\|_2=0$ and so $Xv_k = 0 = \sigma_ku_k$ also. Thus, in either case, $U\Sigma V^Tv_k = Xv_k$ and so $U\Sigma V^T = X$.

The last property we need to verify is that $U$ is orthogonal. Note that this isn’t obvious. At each stage of the process, we made sure that $v_{k+1} \in \text{span}\{v_1,\ldots,v_k\}^\perp$. However, in the case that $\sigma_{k+1} \neq 0$, we simply defined $u_{k+1} = Xv_{k+1}/\sigma_{k+1}$. It is not clear why this would imply that $u_{k+1}$ is orthogonal to $u_1,\ldots,u_k$.

It turns out that a geometric argument is needed to show this. The idea is that if $u_{k+1}$ was not orthogonal to $u_j$ for some $j \le k$, then $v_j$ couldn’t have been the value that maximises $\|Xv\|_2$.

Let $u_{k}$ and $u_j$ be two columns of $U$ with $j < k$ and $\sigma_j,\sigma_k > 0$. We wish to show that $u_j^Tu_k = 0$. To show this we will use the fact that $v_j$ and $v_k$ are orthonormal and perform “polar-interpolation“. That is, for $\lambda \in [0,1]$, define

$v_\lambda = \sqrt{1-\lambda}\cdot v_{j}-\sqrt{\lambda} \cdot v_k.$

Since $v_{j}$ and $v_k$ are orthogonal, we have that

$\|v_\lambda\|_2^2 = (1-\lambda)\|v_{j}\|_2^2+\lambda\|v_k\|_2^2 = (1-\lambda)+\lambda = 1.$

Furthermore $v_\lambda$ is orthogonal to $v_1,\ldots,v_{j-1}$. Thus, by definition of $v_j$,

$\|Xv_\lambda\|_2^2 \le \sigma_j^2 = \|Xv_j\|_2^2.$

By the linearity of $X$ and the definitions of $u_j,u_k$,

$\|Xv_\lambda\|_2^2 = \|\sqrt{1-\lambda}\cdot Xv_j+\sqrt{\lambda}\cdot Xv_{k+1}\|_2^2 = \|\sigma_j\sqrt{1-\lambda}\cdot u_j+\sigma_{k+1}\sqrt{\lambda}\cdot u_{k+1}\|_2^2$.

Since $Xv_j = \sigma_ju_j$ and $Xv_{k+1}=\sigma_{k+1}u_{k+1}$, we have

$(1-\lambda)\sigma_j^2 + 2\sqrt{\lambda(1-\lambda)}\cdot \sigma_1\sigma_2 u_j^Tu_{k}+\lambda\sigma_k^2 = \|Xv_\lambda\|_2^2 \le \sigma_j^2$

Rearranging and dividing by $\sqrt{\lambda}$ gives,

$2\sqrt{1-\lambda}\cdot \sigma_1\sigma_2 u_j^Tu_k \le \sqrt{\lambda}\cdot(\sigma_j^2-\sigma_k^2).$ for all $\lambda \in (0,1]$

Taking $\lambda \searrow 0$ gives $u_j^Tu_k \le 0$. Performing the same polar interpolation with $v_\lambda' = \sqrt{1-\lambda}v_j - \sqrt{\lambda}v_k$ shows that $-u_j^Tu_k \le 0$ and hence $u_j^Tu_k = 0$.

We have thus proved that $U$ is orthogonal. This proof is pretty “slick” but it isn’t very illuminating. To better demonstrate the concept, I made an interactive Desmos graph that you can access here.

This graph shows example vectors $u_j, u_k \in \mathbb{R}^2$. The vector $u_j$ is fixed at $(1,0)$ and a quarter circle of radius $1$ is drawn. Any vectors $u$ that are outside this circle have $\|u\|_2 > 1 = \|u_j\|_2$.

The vector $u_k$ can be moved around inside this quarter circle. This can be done either cby licking and dragging on the point or changing that values of $a$ and $b$ on the left. The red curve is the path of

$\lambda \mapsto \sqrt{1-\lambda}\cdot u_j+\sqrt{\lambda}\cdot u_k$.

As $\lambda$ goes from $0$ to $1$, the path travels from $u_j$ to $u_k$.

Note that there is a portion of the red curve near $u_j$ that is outside the black circle. This corresponds to a small value of $\lambda > 0$ that results in $\|X v_\lambda\|_2 > \|Xv_j\|_2$ contradicting the definition of $v_j$. By moving the point $u_k$ around in the plot you can see that this always happens unless $u_k$ lies exactly on the y-axis. That is, unless $u_k$ is orthogonal to $u_j$.

## Double cosets and contingency tables

Like my previous post, this blog is also motivated by a comment by Professor Persi Diaconis in his recent Stanford probability seminar. The seminar was about a way of “collapsing” a random walk on a group to a random walk on the set of double cosets. In this post, I’ll first define double cosets and then go over the example Professor Diaconis used to make us probabilists and statisticians more comfortable with all the group theory he was discussing.

### Double cosets

Let $G$ be a group and let $H$ and $K$ be two subgroups of $G$. For each $g \in G$, the $(H,K)$-double coset containing $g$ is defined to be the set

$HgK = \{hgk : h \in H, k \in K\}$

To simplify notation, we will simply write double coset instead of $(H,K)$-double coset. The double coset of $g$ can also be defined as the equivalence class of $g$ under the relation

$g \sim g' \Longleftrightarrow g' = hgk$ for some $h \in H$ and $g \in G$

Like regular cosets, the above relation is indeed an equivalence relation. Thus, the group $G$ can be written as a disjoint union of double cosets. The set of all double cosets of $G$ is denoted by $H\backslash G / K$. That is,

$H\backslash G /K = \{HgK : g \in G\}$

Note that if we take $H = \{e\}$, the trivial subgroup, then the double cosets are simply the left cosets of $K$, $G / K$. Likewise if $K = \{e\}$, then the double cosets are the right cosets of $H$, $H \backslash G$. Thus, double cosets generalise both left and right cosets.

### Double cosets in $S_n$

Fix a natural number $n > 0$. A partition of $n$ is a finite sequence $a = (a_1,a_2,\ldots, a_I)$ such that $a_1,a_2,\ldots, a_I \in \mathbb{N}$, $a_1 \ge a_2 \ge \ldots a_I > 0$ and $\sum_{i=1}^I a_i = n$. For each partition of $n$, $a$, we can form a subgroup $S_a$ of the symmetric group $S_n$. The subgroup $S_a$ contains all permutations $\sigma \in S_n$ such that $\sigma$ fixes the sets $A_1 = \{1,\ldots, a_1\}, A_2 = \{a_1+1,\ldots, a_1 + a_2\},\ldots, A_I = \{n - a_I +1,\ldots, n\}$. Meaning that $\sigma(A_i) = A_i$ for all $1 \le i \le I$. Thus, a permutation $\sigma \in S_a$ must individually permute the elements of $A_1$, the elements of $A_2$ and so on. This means that, in a natural way,

$S_a \cong S_{a_1} \times S_{a_2} \times \ldots \times S_{a_I}$

If we have two partitions $a = (a_1,a_2,\ldots, a_I)$ and $b = (b_1,b_2,\ldots,b_J)$, then we can form two subgroups $H = S_a$ and $K = S_b$ and consider the double cosets $H \backslash G / K = S_a \backslash S_n / S_b$. The claim made in the seminar was that the double cosets are in one-to-one correspondence with $I \times J$ contingency tables with row sums equal to $a$ and column sums equal to $b$. Before we explain this correspondence and properly define contingency tables, let’s first consider the cases when either $H$ or $K$ is the trivial subgroup.

### Left cosets in $S_n$

Note that if $a = (1,1,\ldots,1)$, then $S_a$ is the trivial subgroup and, as noted above, $S_a \backslash S_n / S_b$ is simply equal to $S_n / S_b$. We will see that the cosets in $S_n/S_b$ can be described by forgetting something about the permutations in $S_n$.

We can think of the permutations in $S_n$ as all the ways of drawing without replacement $n$ balls labelled $1,2,\ldots, n$. We can think of the partition $b = (b_1,b_2,\ldots,b_J)$ as a colouring of the $n$ balls by $J$ colours. We colour balls $1,2,\ldots, b_1$ by the first colour $c_1$, then we colour $b_1+1,b_1+2,\ldots, b_1+b_2$ the second colour $c_2$ and so on until we colour $n-b_J+1, n-b_J+2,\ldots, n$ the final colour $c_J$. Below is an example when $n$ is equal to 6 and $b = (3,2,1)$.

Note that a permutation $\sigma \in S_n$ is in $S_b$ if and only if we draw the balls by colour groups, i.e. we first draw all the balls with colour $c_1$, then we draw all the balls with colour $c_2$ and so on. Thus, continuing with the previous example, the permutation $\sigma_1$ below is in $S_b$ but $\sigma_2$ is not in $S_b$.

It turns out that we can think of the cosets in $S_n \setminus S_b$ as what happens when we “forget” the labels $1,2,\ldots,n$ and only remember the colours of the balls. By “forgetting” the labels we mean only paying attention to the list of colours. That is for all $\sigma_1,\sigma_2 \in S_n$, $\sigma_1 S_b = \sigma_2 S_b$ if and only if the list of colours from the draw $\sigma_1$ is the same as the list of colours from the draw $\sigma_2$. Thus, the below two permutations define the same coset of $S_b$

To see why this is true, note that $\sigma_1 S_b = \sigma_2 S_b$ if and only if $\sigma_2 = \sigma_1 \circ \tau$ for some $\tau \in S_b$. Furthermore, $\tau \in S_b$ if and only if $\tau$ maps each colour group to itself. Recall that function composition is read right to left. Thus, the equation $\sigma_2 = \sigma_1 \circ \tau$ means that if we first relabel the balls according to $\tau$ and then draw the balls according to $\sigma_1$, then we get the result as just drawing by $\sigma_2$. That is, $\sigma_2 = \sigma_1 \circ \tau$ for some $\tau \in S_b$ if and only if drawing by $\sigma_2$ is the same as first relabelling the balls within each colour group and then drawing the balls according to $\sigma_1$. Thus, $\sigma_1 S_b = \sigma_2 S_b$, if and only if when we forget the labels of the balls and only look at the colours, $\sigma_1$ and $\sigma_2$ give the same list of colours. This is illustrated with our running example below.

### Right cosets of $S_n$

Typically, the subgroup $S_a$ is not a normal subgroup of $S_n$. This means that the right coset $S_a \sigma$ will not equal the left coset $\sigma S_a$. Thus, colouring the balls and forgetting the labelling won’t describe the right cosets $S_a \backslash S_n$. We’ll see that a different type of forgetting can be used to describe $S_a \backslash S_n = \{S_a\sigma : \sigma \in S_n\}$.

Fix a partition $a = (a_1,a_2,\ldots,a_I)$ and now, instead of considering $I$ colours, think of $I$ different people $p_1,p_2,\ldots,p_I$. As before, a permutation $\sigma \in S_n$ can be thought of drawing $n$ balls labelled $1,\ldots,n$ without replacement. We can imagine giving the first $a_1$ balls drawn to person $p_1$, then giving the next $a_2$ balls to the person $p_2$ and so on until we give the last $a_I$ balls to person $p_I$. An example with $n = 6$ and $a = (2,2,2)$ is drawn below.

Note that $\sigma \in S_a$ if and only if person $p_i$ receives the balls with labels $\sum_{k=0}^{i-1}a_k+1,\sum_{k=0}^{i-1}a_k+2,\ldots, \sum_{k=0}^i a_k$ in any order. Thus, in the below example $\sigma_1 \in S_a$ but $\sigma_2 \notin S_a$.

It turns out the cosets $S_a \backslash S_n$ are exactly determined by “forgetting” the order in which each person $p_i$ received their balls and only remembering which balls they received. Thus, the two permutation below belong to the same coset in $S_a \backslash S_n$.

To see why this is true in general, consider two permutations $\sigma_1,\sigma_2$. The permutations $\sigma_1,\sigma_2$ result in each person $p_i$ receiving the same balls if and only if after $\sigma_1$ we can apply a permutation $\tau$ that fixes each subset $A_i = \left\{\sum_{k=0}^{i-1}a_k+1,\ldots, \sum_{k=0}^i a_k\right\}$ and get $\sigma_2$. That is, $\sigma_1$ and $\sigma_2$ result in each person $p_i$ receiving the same balls if and only if $\sigma_2 = \tau \circ \sigma_1$ for some $\tau \in S_a$. Thus, $\sigma_1,\sigma_2$ are the same after forgetting the order in which each person received their balls if and only if $S_a \sigma_1 = S_a \sigma_2$. This is illustrated below,

We can thus see why $S_a \backslash S_n \neq S_n / S_a$. A left coset $\sigma S_a$ correspond to pre-composing $\sigma$ with elements of $S_a$ and a right cosets $S_a\sigma$ correspond to post-composing $\sigma$ with elements of $S_a$.

### Contingency tables

With the last two sections under our belts, describing the double cosets $S_a \backslash S_n / S_b$ is straight forward. We simply have to combine our two types of forgetting. That is, we first colour the $n$ balls with $J$ colours according to $b = (b_1,b_2,\ldots,b_J)$. We then draw the balls without replace and give the balls to $I$ different people according $a = (a_1,a_2,\ldots,a_I)$. We then forget both the original labels and the order in which each person received their balls. That is, we only remember the number of balls of each colour each person receives. Describing the double cosets by “double forgetting” is illustrated below with $a = (2,2,2)$ and $b = (3,2,1)$.

The proof that double forgetting does indeed describe the double cosets is simply a combination of the two arguments given above. After double forgetting, the number of balls given to each person can be recorded in an $I \times J$ table. The $N_{ij}$ entry of the table is simply the number of balls person $p_i$ receives of colour $c_j$. Two permutations are the same after double forgetting if and only if they produce the same table. For example, $\sigma_1$ and $\sigma_2$ above both produce the following table

By the definition of how the balls are coloured and distributed to each person we must have for all $1 \le i \le I$ and $1 \le j \le J$

$\sum_{j=1}^J N_{ij} = a_i$ and $\sum_{i=1}^I N_{ij} = b_j$

An $I \times J$ table with entries $N_{ij} \in \{0,1,2,\ldots\}$ satisfying the above conditions is called a contingency table. Given such a contingency table with entries $N_{ij}$ where the rows sum to $a$ and the columns sum to $b$, there always exists at least one permutation $\sigma$ such that $N_{ij}$ is the number of balls received by person $p_i$ of colour $c_i$. We have already seen that two permutations produce the same table if and only if they are in the same double coset. Thus, the double cosets $S_a \backslash S_n /S_b$ are in one-to-one correspondence with such contingency tables.

### The hypergeometric distribution

I would like to end this blog post with a little bit of probability and relate the contingency tables above to the hyper geometric distribution. If $a = (m, n-m)$ for some $m \in \{0,1,\ldots,n\}$, then the contingency tables described above have two rows and are determined by the values $N_{11}, N_{12},\ldots, N_{1J}$ in the first row. The numbers $N_{1j}$ are the number of balls of colour $c_j$ the first person receives. Since the balls are drawn without replacement, this means that if we put the uniform distribution on $S_n$, then the vector $Y=(N_{11}, N_{12},\ldots, N_{1J})$ follows the multivariate hypergeometric distribution. Thus, if we have a random walk on $S_n$ that quickly converges to the uniform distribution on $S_n$, then we could use the double cosets to get a random walk that converges to the multivariate hypergeometric distribution (although there are smarter ways to do such sampling).

## The field with one element in a probability seminar

Something very exciting this afternoon. Professor Persi Diaconis was presenting at the Stanford probability seminar and the field with one element made an appearance. The talk was about joint work with Mackenzie Simper and Arun Ram. They had developed a way of “collapsing” a random walk on a group to a random walk on the set of double cosets. As an illustrative example, Persi discussed a random walk on $GL_n(\mathbb{F}_q)$ given by multiplication by a random transvection (a map of the form $v \mapsto v + ab^Tv$, where $a^Tb = 0$).

The Bruhat decomposition can be used to match double cosets of $GL_n(\mathbb{F}_q)$ with elements of the symmetric group $S_n$. So by collapsing the random walk on $GL_n(\mathbb{F}_q)$ we get a random walk on $S_n$ for all prime powers $q$. As Professor Diaconis said, you can’t stop him from taking $q \to 1$ and asking what the resulting random walk on $S_n$ is. The answer? Multiplication by a random transposition. As pointed sets are vector spaces over the field with one element and the symmetric groups are the matrix groups, this all fits with what’s expected of the field with one element.

This was just one small part of a very enjoyable seminar. There was plenty of group theory, probability, some general theory and engaging examples.

Update: I have written another post about some of the group theory from the seminar! You can read it here: Double cosets and contingency tables.

## Working with ABC Radio’s API in R

This post is about using ABC Radio’s API to create a record of all the songs played on Triple J in a given time period. An API (application programming interface) is a connection between two computer programs. Many websites have API’s that allow users to access data from that website. ABC Radio has an API that lets users access a record of the songs played on the station Triple J since around May 2014. Below I’ll show how to access this information and how to transform it into a format that’s easier to work with. All code can be found on GitHub here.

### Packages

To access the API in R I used the packages “httr” and “jsonlite”. To transform the data from the API I used the packages “tidyverse” and “lubridate”.

library(httr)
library(jsonlite)
library(tidyverse)
library(lubridate)

The URL for the ABC radio’s API is:

If you type this into a web browser, you can see a bunch of text which actually lists information about the ten most recently played songs on ABC’s different radio stations. To see only the songs played on Triple J, you can go to:

The extra text “station=triplej” is called a parameter. We can add more parameters such as “from” and “to”. The link:

shows information about the songs played in the 30 minutes after midnight UTC on the 16th of January last year. The last parameter that we will use is limit which specifies the number of songs to include. The link:

includes information about the most recently played song. The default for limit is 10 and the largest possible value is 100. We’ll see later that this cut off at 100 makes downloading a lot of songs a little tricky. But first let’s see how we can access the information from ABC Radio’s API in R.

### Accessing an API in R

We now know how to use ABC Radio’s API to get information about the songs played on Triple J but how do we use this information in R and how is the information stored. The function GET from the package “httr” enables us to access the API in R. The input to GET is simply the same sort of URL that we’ve been using to view the API online. The below code stores information about the 5 songs played on Triple J just before 5 am on the 5th of May 2015

res <- GET("https://music.abcradio.net.au/api/v1/plays/search.json?station=triplej&limit=5&to=2015-05-05T05:00:00.000Z")
res
# Date: 2022-03-02 23:25
# Status: 200
# Content-Type: application/json
# Size: 62.9 kB


The line “Status: 200” tells us that we have successfully grabbed some data from the API. There are many other status codes that mean different thing but for now we’ll be happy remembering that “Status: 200” means everything has worked.

The information from the API is stored within the object res. To change it from a JSON file to a list we can use the function “fromJSON” from the library jsonlite. This is done below

data <- fromJSON(rawToChar(res$content)) names(data) # "total" "offset" "items" The information about the individual songs is stored under items. There is a lot of information about each song including song length, copyright information and a link to an image of the album cover. For now, we’ll just try to find the name of the song, the name of the artist and the time the song played. We can find each of these under “items”. Finding the song title and the played time are pretty straight forward: data$items$recording$title
# "First Light" "Dog"         "Begin Again" "Rot"         "Back To You"
data$items$played_time
# "2015-05-05T04:55:56+00:00" "2015-05-05T04:52:21+00:00" "2015-05-05T04:47:52+00:00" "2015-05-05T04:41:00+00:00" "2015-05-05T04:38:36+00:00"

Finding the artist name is a bit trickier.

data$items$recording$artists # [[1]] # entity arid name artwork # 1 Artist maQeOWJQe1 Django Django NULL # links # 1 Link, mlD5AM2R5J, http://musicbrainz.org/artist/4bfce038-b1a0-4bc4-abe1-b679ab900f03, 4bfce038-b1a0-4bc4-abe1-b679ab900f03, MusicBrainz artist, NA, NA, NA, service, musicbrainz, TRUE # is_australian type role # 1 NA primary NA # # [[2]] # entity arid name artwork # 1 Artist maXE59XXz7 Andy Bull NULL # links # 1 Link, mlByoXMP5L, http://musicbrainz.org/artist/3a837db9-0602-4957-8340-05ae82bc39ef, 3a837db9-0602-4957-8340-05ae82bc39ef, MusicBrainz artist, NA, NA, NA, service, musicbrainz, TRUE # is_australian type role # 1 NA primary NA # .... # .... We can see that each song actually has a lot of information about each artist but we’re only interested in the artist name. By using “map_chr()” from the library “tidyverse” we can grab each song’s artist name. map_chr(data$items$recording$artists, ~.$name[1]) # [1] "Django Django" "Andy Bull" "Purity Ring" "Northlane" "Twerps" Using name[1] means that if there are multiple artists, then we only select the first one. With all of these in place we can create a tibble with the information about these five songs. list <- data$items
tb <- tibble(
song_title = list$recording$title,
artist = map_chr(list$recording$artists, ~.$name[1]), played_time = ymd_hms(list$played_time)
) %>%
arrange(played_time)
tb
# # A tibble: 5 x 3
# song_title  artist        played_time
# <chr>       <chr>         <dttm>
# 1 Back To You Twerps        2015-05-05 04:38:36
# 2 Rot         Northlane     2015-05-05 04:41:00
# 3 Begin Again Purity Ring   2015-05-05 04:47:52
# 4 Dog         Andy Bull     2015-05-05 04:52:21
# 5 First Light Django Django 2015-05-05 04:55:56

The maximum number of songs we can access from the API at one time is 100. This means that if we want to download a lot of songs we’ll need to use some sort of loop. Below is some code which takes in two dates and times in UTC and fetches all the songs played between the two times. The idea is simply to grab all the songs played in a five hour interval and then move on to the next five hour interval. I found including an optional value “progress” useful for the debugging. This code is from the file get_songs.r on my GitHub.

download <- function(from, to){
from_char <- format(from, "%Y-%m-%dT%H:%M:%S.000Z")
to_char <- format(to, "%Y-%m-%dT%H:%M:%S.000Z")
url <- paste0(base,
from_char,
"&to=",
to_char)
res <- GET(url)
if (res$status == 200) { data <- fromJSON(rawToChar(res$content))
list <- data$items tb <- tibble( song_title = list$recording$title, artist = map_chr(list$recording$artists, ~.$name[1]),
played_time = ymd_hms(list\$played_time)
) %>%
arrange(played_time)
return(tb)
}
}

get_songs <- function(start, end, progress = FALSE){
from <- ymd_hms(start)
to <- from + dhours(5)
end <- ymd_hms(end)

songs <- NULL

while (to < end) {
if (progress) {
print(from)
print(to)
}
songs <- bind_rows(songs, tb)
from <- to + dseconds(1)
to <- to + dhours(5)
}
songs <- bind_rows(songs, tb)
return(songs)
}

### An example

Using the functions defined above, we can download all the sounds played in the year 2021 (measured in AEDT).

songs <- get_songs(ymd_hms("2020-12-31 13:00:01 UTC"),
ymd_hms("2021-12-31 12:59:59 UTC",
progress = TRUE)
`

This code takes a little while to run so I have saved the output as a .csv file. There are lots of things I hope to do with this data such as looking at the popularity of different songs over time. For example, here’s a plot showing the cumulative plays of Genesis Owusu’s top six songs.

I’ve also looked at using the data from last year to get a list of Triple J’s most played artists. Unfortunately, it doesn’t line up with the official list. The issue is that I’m only recording the primary artist on each song and not any of the secondary artists. Hopefully I’ll address this in a later blog post! I would also like to an analysis of how plays on Triple J correlate with song ranking in the Hottest 100.

## Maximum likelihood estimation and the method of moments

Maximum likelihood and the method of moments are two ways to estimate parameters from data. In general, the two methods can differ but for one-dimensional exponential families they produce the same estimates.

Suppose that $\{P_\theta\}_{\theta \in \Omega}$ is a one-dimensional exponential family written in canonical form. That is, $\Omega \subseteq \mathbb{R}$ and there exists a reference measure $\mu$ such each distribution $P_\theta$ has a density $p_\theta$ with respect to $\mu$ and,

$p_\theta(x) = h(x)\exp\left(\theta T(x)-A(\theta)\right).$

The random variable $T(X)$ is a sufficient statistic for the model $X \sim P_\theta$. The function $A(\theta)$ is the log-partition function for the family $\{P_\theta\}_{\theta \in \Omega}$. The condition, $\int p_\theta(x)\mu(dx)=1$ implies that

$A(\theta) = \log\left(\int h(x)\exp(\theta T(x))\mu(dx) \right).$

It turns out that the function $A(\theta)$ is differentiable and that differentiation and integration are exchangeable. This implies that

$A'(\theta) = \frac{\int h(x)\frac{d}{d\theta}\exp(\theta T(x))\mu(dx)}{\int h(x)\exp(\theta T(x))\mu(dx)} = \frac{\int h(x)\frac{d}{d\theta}\exp(\theta T(x))\mu(dx)}{\exp(A(\theta))}$

Note that $\int h(x)\frac{d}{d\theta}\exp(\theta T(x))\mu(dx) = \int T(x)h(x)\exp(\theta T(x))\mu(dx)$. Thus,

$A'(\theta) = \int T(x)h(x) \exp(\theta T(x)-A(\theta))\mu(dx) = \int T(x)p_\theta(x)\mu(dx).$

This means that $A'(\theta) = \mathbb{E}_\theta[T(X)]$, the expectation of $T(X)$ under $X \sim P_\theta$.

Now suppose that we have an i.i.d. sample $X_1,\ldots, X_n \sim P_\theta$ and we want to use this sample to estimate $\theta$. One way to estimate $\theta$ is by maximum likelihood. That is, we choose the value of $\theta$ that maximises the likelihood,

$L(\theta|X_1,\ldots,X_n) = \prod_{i=1}^n p_\theta(X_i).$

When using the maximum likelihood estimator, it is often easier to work with the log-likelihood. The log-likelihood is,

$\log L(\theta|X_1,\ldots,X_n) = \sum_{i=1}^n \log\left(p_\theta(X_i)\right) = \sum_{i=1}^n \log(h(X_i))+\theta T(X_i) - A(\theta)$.

Maximising the likelihood is equivalent to maximising the log-likelihood. For exponential families, the log-likelihood is a concave function of $\theta$. Thus the maximisers can be found be differentiation and solving the first order equations. Note that,

$\frac{d}{d\theta} \log L(\theta|X_1,\ldots,X_n) =\sum_{i=1}^n T(X_i)-A'(\theta) = -nA'(\theta) + \sum_{i=1}^n T(X_i).$

Thus the maximum likelihood estimate (MLE) $\widehat{\theta}$ solves the equation,

$-nA'(\widehat{\theta}) + \sum_{i=1}^n T(X_i) = 0.$

But recall that $A'(\widehat{\theta}) = \mathbb{E}_{\widehat{\theta}}[T(X)]$. Thus the MLE is the solution to the equation,

$\mathbb{E}_{\widehat{\theta}}[T(X)] = \frac{1}{n}\sum_{i=1}^n T(X_i)$.

Thus the MLE is the value of $\theta$ for which the expectation of $T(X)$ matches the empirical average from our sample. That is, the maximum likelihood estimator for an exponential family is a method of moments estimator. Specifically, the maximum likelihood estimator matches the moments of the sufficient statistic $T(X)$.

## A counter example

It is a special property of maximum likelihood estimators that the MLE is a method of moments estimator for the sufficient statistic. When we leave the nice world of exponential families, the estimators may differ.

Suppose that we have data $X_1,\ldots,X_n \sim P_\theta$ where $P_\theta$ is the uniform distribution on $[0,\theta]$. A minimal sufficient statistic for this model is $X_{(n)}$ – the maximum of $X_1,\ldots, X_n$. Given what we saw before, we might imague that the MLE for this model would be a method of moments estimator for $X_{(n)}$ but this isn’t the case.

The likelihood for $X_1,\ldots,X_n$ is,

$L(\theta|X_1,\ldots,X_n) = \begin{cases} \frac{1}{\theta^n} & \text{if } X_{(n)} \le \theta,\\ 0 & \text{if } X_{(n)} > \theta. \end{cases}$

Thus the MLE is $\widehat{\theta} = X_{(n)}$. However, under $P_\theta$, $\frac{1}{\theta}X_{(n)}$ has a $\text{Beta}(n,1)$ distribution. Thus, $\mathbb{E}_\theta[X_{(n)}] = \theta \frac{n}{n+1}$ so the method of moments estimator would be $\widehat{\theta}' = \frac{n+1}{n}X_{(n)} \neq X_{(n)} = \widehat{\theta}$.

## A not so simple conditional expectation

It is winter 2022 and my PhD cohort has moved on the second quarter of our first year statistics courses. This means we’ll be learning about generalised linear models in our applied course, asymptotic statistics in our theory course and conditional expectations and martingales in our probability course.

In the first week of our probability course we’ve been busy defining and proving the existence of the conditional expectation. Our approach has been similar to how we constructed the Lebesgue integral in the previous course. Last quarter, we first defined the Lebesgue integral for simple functions, then we used a limiting argument to define the Lebesgue integral for non-negative functions and then finally we defined the Lebesgue integral for arbitrary functions by considering their positive and negative parts.

Our approach to the conditional expectation has been similar but the journey has been different. We again started with simple random variables, then progressed to non-negative random variables and then proved the existence of the conditional expectation of any arbitrary integrable random variable. Unlike the Lebesgue integral, the hardest step was proving the existence of the conditional expectation of a simple random variable. Progressing from simple random variables to arbitrary random variables was a straight forward application of the monotone convergence theorem and linearity of expectation. But to prove the existence of the conditional expectation of a simple random variable we needed to work with projections in the Hilbert space $L^2(\Omega, \mathbb{P},\mathcal{F})$.

Unlike the Lebesgue integral, defining the conditional expectation of a simple random variable is not straight forward. One reason for this is that the conditional expectation of a random variable need not be a simple random variable. This comment was made off hand by our Professor and sparked my curiosity. The following example is what I came up with. Below I first go over some definitions and then we dive into the example.

## A simple random variable with a conditional expectation that is not simple

Let $(\Omega, \mathbb{P}, \mathcal{F})$ be a probability space and let $\mathcal{G} \subseteq \mathcal{F}$ be a sub-$\sigma$-algebra. The conditional expectation of an integrable random variable $X$ is a random variable $\mathbb{E}(X|\mathcal{G})$ that satisfies the following two conditions:

1. The random variable $\mathbb{E}(X|\mathcal{G})$ is $\mathcal{G}$-measurable.
2. For all $B \in \mathcal{G}$, $\mathbb{E}[X1_B] = \mathbb{E}[\mathbb{E}(X|\mathcal{G})1_B]$, where $1_B$ is the indicator function of $B$.

The conditional expectation of an integrable random variable is unique and always exists. One can think of $\mathbb{E}(X|\mathcal{G})$ as the expected value of $X$ given the information in $\mathcal{G}$.

A simple random variable is a random variable $X$ that take only finitely many values. Simple random variables are always integrable and so $\mathbb{E}(X|\mathcal{G})$ always exists but we will see that $\mathbb{E}(X|\mathcal{G})$ need not be simple.

Consider a random vector $(U,V)$ uniformly distributed on the square $[-1,1]^2 \subseteq \mathbb{R}^2$. Let $D$ be the unit disc $D = \{(u,v) \in \mathbb{R}^2:u^2+v^2 \le 1\}$. The random variable $X = 1_D(U,V)$ is a simple random variable since $X$ equals $1$ if $(U,V) \in D$ and $X$ equals $0$ otherwise. Let $\mathcal{G} = \sigma(U)$ the $\sigma$-algebra generated by $U$. It turns out that

$\mathbb{E}(X|\mathcal{G}) = \sqrt{1-U^2}$.

Thus $\mathbb{E}(X|\mathcal{G})$ is not a simple random variable. Let $Y = \sqrt{1-U^2}$. Since $Y$ is a continuous function of $U$, the random variable is $\mathcal{G}$-measurable. Thus $Y$ satisfies condition 1. Furthermore if $B \in \mathcal{G}$, then $B = \{U \in A\}$ for some measurable set $A\subseteq [-1,1]$. Thus $X1_B$ equals $1$ if and only if $U \in A$ and $V \in [-\sqrt{1-U^2}, \sqrt{1-U^2}]$. Since $(U,V)$ is uniformly distributed we thus have

$\mathbb{E}[X1_B] = \int_A \int_{-\sqrt{1-u^2}}^{\sqrt{1-u^2}} \frac{1}{4}dvdu = \int_A \frac{1}{2}\sqrt{1-u^2}du$.

The random variable $U$ is uniformly distributed on $[-1,1]$ and thus has density $\frac{1}{2}1_{[-1,1]}$. Therefore,

$\mathbb{E}[Y1_B] = \mathbb{E}[\sqrt{1-U^2}1_{\{U \in A\}}] = \int_A \frac{1}{2}\sqrt{1-u^2}du$.

Thus $\mathbb{E}[X1_B] = \mathbb{E}[Y1_B]$ and therefore $Y = \sqrt{1-U^2}$ equals $\mathbb{E}(X|\mathcal{G})$. Intuitively we can see this because given $U=u$, we know that $X$ is $1$ when $V \in [-\sqrt{1-u^2},\sqrt{1+u^2}]$ and that $X$ is $0$ otherwise. Since $V$ is uniformly distributed on $[-1,1]$ the probability that $V$ is in $[-\sqrt{1-u^2},\sqrt{1+u^2}]$ is $\sqrt{1-u^2}$. Thus given $U=u$, the expected value of $X$ is $\sqrt{1-u^2}$.

## An extension

The previous example suggests an extension that shows just how “complicated” the conditional expectation of a simple random variable can be. I’ll state the extension as an exercise:

Let $f:[-1,1]\to \mathbb{R}$ be any continuous function with $f(x) \in [0,1]$. With $(U,V)$ and $\mathcal{G}$ as above show that there exists a measurable set $A \subseteq [-1,1]^2$ such that $\mathbb{E}(1_A|\mathcal{G}) = f(U)$.

## Sports climbing at the 2020 Olympics

Last year sports climbing made its Olympic debut. My partner Claire is an avid climber and we watched some of the competition together. The method of ranking the competitors had some surprising consequences which are the subject of this blog.

In the finals, the climbers had to compete in three disciplines – speed climbing, bouldering and lead climbing. Each climber was given a rank for each discipline. For speed climbing, the rank was given based on how quickly the climbers scaled a set route. For bouldering and lead climbing, the rank was based on how far they climbed on routes none of them had seen before.

To get a final score, the ranks from the three disciplines are multiplied together. The best climber is the one with the lowest final score. For example, if a climber was 5th in speed climbing, 2nd in bouldering and 3rd in lead climbing, then they would have a final score of $5\times 2 \times 3 = 30$. A climber who came 8th in speed, 1st in bouldering and 2nd in lead climbing would have a lower (and thus better) final score of $8 \times 1 \times 2 = 16$.

One thing that makes this scoring system interesting is that your final score is very dependent on how well your competitors perform. This was most evident during Jakob Schubert’s lead climb in the men’s final.

Lead climbing is the final discipline and Schubert was the last climber. This meant that ranks for speed climbing and bouldering were already decided. However, the final rankings of the climbers fluctuated hugely as Schubert scaled the 15 metre lead wall. This was because if a climber had done really well in the boulder and speed rounds, being overtaken by Schubert didn’t increase their final score by that much. However, if a climber did poorly in the previous rounds, then being overtaken by Schubert meant their score increased by a lot and they plummeted down the rankings. This can be visualised in the following plot:

Along the x-axis we have how far up the lead wall Schubert climbed (measured by the “hold” he had reached). On the y-axis we have the finalists’ ranking at different stages of Schubert’s climb. We can see that as Schubert climbed and overtook the other climbers the whole ranking fluctuated. Here is a similar plot which also shows when Schubert overtook each competitor on the lead wall:

The volatility of the rankings can really be seen by following Adam Ondra’s ranking (in purple). When Schubert started his climb, Ondra was ranked second. After Schubert passed Albert Gines Lopez, Ondra was ranked first. But then Schubert passed Ondra, Ondra finished the event in 6th place and Gines Lopez came first. If you go to the 4 hour and 27 minutes mark here you can watch Schubert’s climb and hear the commentator explain how both Gines Lopez and Ondra are in the running to win gold.

Similar things happened in the women’s finals. Janja Garnbret was the last lead climber in the women’s final. Here is the same plot which shows the climbers’ final rankings and lead ranking.

Garnbret was a favourite at the Olympics and had come fifth in the speed climbing and first in bouldering. This meant that as long as she didn’t come last in the lead climbing she would at least take home silver and otherwise she’ll get the gold. Garnbret ended up coming first in the lead climbing and we can see that as she overtook the last few climbers, their ranking fluctuated wildly.

Here is one more plot which shows each competitor’s final score at different points of Garnbret’s climb.

In the plot you can really see that, depending on how they performed in the previous two events, each climber’s score changed by a different amount once they were overtaken. It also shows how Garnbret is just so ahead of the competition – especially when you compare the men’s and women’s finals. Here is the same plot for the men. You can see that the men’s final scores were a lot closer together.

Before I end this post, I would like to make one comment about the culture of sport climbing. In this post I wanted to highlight how tumultuous and complicated the sport climbing rankings were but if you watched the athletes you’d have no idea the stakes were so high. The climbers celebrate their personal bests as if no one was watching, they trade ideas on how to tackle the lead wall and the day after the final, they returned to the bouldering wall to try the routes together. Sports climbing is such a friendly and welcoming sport and I would hate for my analysis to give anyone the wrong idea.