Double cosets and contingency tables

Like my previous post, this blog is also motivated by a comment by Professor Persi Diaconis in his recent Stanford probability seminar. The seminar was about a way of “collapsing” a random walk on a group to a random walk on the set of double cosets. In this post, I’ll first define double cosets and then go over the example Professor Diaconis used to make us probabilists and statisticians more comfortable with all the group theory he was discussing.

Double cosets

Let G be a group and let H and K be two subgroups of G. For each g \in G, the (H,K)-double coset containing g is defined to be the set

HgK = \{hgk : h \in H, k \in K\}

To simplify notation, we will simply write double coset instead of (H,K)-double coset. The double coset of g can also be defined as the equivalence class of g under the relation

g \sim g' \Longleftrightarrow g' = hgk for some h \in H and g \in G

Like regular cosets, the above relation is indeed an equivalence relation. Thus, the group G can be written as a disjoint union of double cosets. The set of all double cosets of G is denoted by H\backslash G / K. That is,

H\backslash G /K = \{HgK : g \in G\}

Note that if we take H = \{e\}, the trivial subgroup, then the double cosets are simply the left cosets of K, G / K. Likewise if K = \{e\}, then the double cosets are the right cosets of H, H \backslash G. Thus, double cosets generalise both left and right cosets.

Double cosets in S_n

Fix a natural number n > 0. A partition of n is a finite sequence a = (a_1,a_2,\ldots, a_I) such that a_1,a_2,\ldots, a_I \in \mathbb{N}, a_1 \ge a_2 \ge \ldots a_I > 0 and \sum_{i=1}^I a_i = n. For each partition of n, a, we can form a subgroup S_a of the symmetric group S_n. The subgroup S_a contains all permutations \sigma \in S_n such that \sigma fixes the sets A_1 = \{1,\ldots, a_1\}, A_2 = \{a_1+1,\ldots, a_1 + a_2\},\ldots, A_I = \{n - a_I +1,\ldots, n\}. Meaning that \sigma(A_i) = A_i for all 1 \le i \le I. Thus, a permutation \sigma \in S_a must individually permute the elements of A_1, the elements of A_2 and so on. This means that, in a natural way,

S_a \cong S_{a_1} \times S_{a_2} \times \ldots \times S_{a_I}

If we have two partitions a = (a_1,a_2,\ldots, a_I) and b = (b_1,b_2,\ldots,b_J), then we can form two subgroups H = S_a and K = S_b and consider the double cosets H \backslash G / K = S_a \backslash S_n / S_b. The claim made in the seminar was that the double cosets are in one-to-one correspondence with I \times J contingency tables with row sums equal to a and column sums equal to b. Before we explain this correspondence and properly define contingency tables, let’s first consider the cases when either H or K is the trivial subgroup.

Left cosets in S_n

Note that if a = (1,1,\ldots,1), then S_a is the trivial subgroup and, as noted above, S_a \backslash S_n / S_b is simply equal to S_n / S_b. We will see that the cosets in S_n/S_b can be described by forgetting something about the permutations in S_n.

We can think of the permutations in S_n as all the ways of drawing without replacement n balls labelled 1,2,\ldots, n. We can think of the partition b = (b_1,b_2,\ldots,b_J) as a colouring of the n balls by J colours. We colour balls 1,2,\ldots, b_1 by the first colour c_1, then we colour b_1+1,b_1+2,\ldots, b_1+b_2 the second colour c_2 and so on until we colour n-b_J+1, n-b_J+2,\ldots, n the final colour c_J. Below is an example when n is equal to 6 and b = (3,2,1).

The first three balls are coloured green, the next two are coloured red and the last ball is coloured blue.

Note that a permutation \sigma \in S_n is in S_b if and only if we draw the balls by colour groups, i.e. we first draw all the balls with colour c_1, then we draw all the balls with colour c_2 and so on. Thus, continuing with the previous example, the permutation \sigma_1 below is in S_b but \sigma_2 is not in S_b.

The permutation \sigma_1 is in S_b because the colours are in their original order but \sigma_1 is not in S_b because the colours are rearranged.

It turns out that we can think of the cosets in S_n \setminus S_b as what happens when we “forget” the labels 1,2,\ldots,n and only remember the colours of the balls. By “forgetting” the labels we mean only paying attention to the list of colours. That is for all \sigma_1,\sigma_2 \in S_n, \sigma_1 S_b = \sigma_2 S_b if and only if the list of colours from the draw \sigma_1 is the same as the list of colours from the draw \sigma_2. Thus, the below two permutations define the same coset of S_b

When we forget the labels and only remember the colours, the permutations \sigma_1 and \sigma_2 look the same and thus are in the same left coset of S_b.

To see why this is true, note that \sigma_1 S_b = \sigma_2 S_b if and only if \sigma_2 = \sigma_1 \circ \tau for some \tau \in S_b. Furthermore, \tau \in S_b if and only if \tau maps each colour group to itself. Recall that function composition is read right to left. Thus, the equation \sigma_2 = \sigma_1 \circ \tau means that if we first relabel the balls according to \tau and then draw the balls according to \sigma_1, then we get the result as just drawing by \sigma_2. That is, \sigma_2 = \sigma_1 \circ \tau for some \tau \in S_b if and only if drawing by \sigma_2 is the same as first relabelling the balls within each colour group and then drawing the balls according to \sigma_1. Thus, \sigma_1 S_b = \sigma_2 S_b, if and only if when we forget the labels of the balls and only look at the colours, \sigma_1 and \sigma_2 give the same list of colours. This is illustrated with our running example below.

If we permute the balls according to \tau \in S_b and the draw according to \sigma_1, then the resulting draw is the same as if we had not permuted and drawn according to \sigma_2. That is, \sigma_2 = \sigma_1 \circ \tau.

Right cosets of S_n

Typically, the subgroup S_a is not a normal subgroup of S_n. This means that the right coset S_a \sigma will not equal the left coset \sigma S_a. Thus, colouring the balls and forgetting the labelling won’t describe the right cosets S_a \backslash S_n. We’ll see that a different type of forgetting can be used to describe S_a \backslash S_n = \{S_a\sigma : \sigma \in S_n\}.

Fix a partition a = (a_1,a_2,\ldots,a_I) and now, instead of considering I colours, think of I different people p_1,p_2,\ldots,p_I. As before, a permutation \sigma \in S_n can be thought of drawing n balls labelled 1,\ldots,n without replacement. We can imagine giving the first a_1 balls drawn to person p_1, then giving the next a_2 balls to the person p_2 and so on until we give the last a_I balls to person p_I. An example with n = 6 and a = (2,2,2) is drawn below.

Person p_1 receives the ball labelled by 6 followed by the ball labelled 3, person p_2 receives ball 2 and then ball 1 and finally person p_3 receives ball 4 followed by ball 5.

Note that \sigma \in S_a if and only if person p_i receives the balls with labels \sum_{k=0}^{i-1}a_k+1,\sum_{k=0}^{i-1}a_k+2,\ldots, \sum_{k=0}^i a_k in any order. Thus, in the below example \sigma_1 \in S_a but \sigma_2 \notin S_a.

When the balls are drawn according to \sigma_1, person p_i receives the balls with labels 2i-1 and 2i, and thus \sigma_1 \in S_a. On the other hand, if the balls are drawn according to \sigma_2, the people receive different balls and thus \sigma_2 \notin S_a.

It turns out the cosets S_a \backslash S_n are exactly determined by “forgetting” the order in which each person p_i received their balls and only remembering which balls they received. Thus, the two permutation below belong to the same coset in S_a \backslash S_n.

When we forget the order in which each person receive their balls, the permutations \sigma_1 and \sigma_2 become the same and thus S_a \sigma_1 = S_a \sigma_2. Note that if we coloured the balls according to the permutation a = (a_1,\ldots,a_I), then we could see that \sigma_1 S_a \neq \sigma_2 S_a.

To see why this is true in general, consider two permutations \sigma_1,\sigma_2. The permutations \sigma_1,\sigma_2 result in each person p_i receiving the same balls if and only if after \sigma_1 we can apply a permutation \tau that fixes each subset A_i  =  \left\{\sum_{k=0}^{i-1}a_k+1,\ldots, \sum_{k=0}^i a_k\right\} and get \sigma_2. That is, \sigma_1 and \sigma_2 result in each person p_i receiving the same balls if and only if \sigma_2 = \tau \circ \sigma_1 for some \tau \in S_a. Thus, \sigma_1,\sigma_2 are the same after forgetting the order in which each person received their balls if and only if S_a \sigma_1 = S_a \sigma_2. This is illustrated below,

If we first draw the balls according to \sigma_1 and then permute the balls according to \tau, then the resulting draw is the same as if we had drawn according to \sigma_2 and not permuted afterwards. That is, \sigma_2 = \tau \circ \sigma_1 .

We can thus see why S_a \backslash S_n \neq S_n / S_a. A left coset \sigma S_a correspond to pre-composing \sigma with elements of S_a and a right cosets S_a\sigma correspond to post-composing \sigma with elements of S_a.

Contingency tables

With the last two sections under our belts, describing the double cosets S_a \backslash S_n / S_b is straight forward. We simply have to combine our two types of forgetting. That is, we first colour the n balls with J colours according to b = (b_1,b_2,\ldots,b_J). We then draw the balls without replace and give the balls to I different people according a = (a_1,a_2,\ldots,a_I). We then forget both the original labels and the order in which each person received their balls. That is, we only remember the number of balls of each colour each person receives. Describing the double cosets by “double forgetting” is illustrated below with a = (2,2,2) and b = (3,2,1).

The permutations \sigma_1 and \sigma_2 both result in person p_1 receiving one green ball and one blue ball. The two permutations also results in p_2 and p_3 both receiving one green ball and one red ball. Thus, \sigma_1 and \sigma_2 are both in the same (S_a,S_b)-double coset. Note however that \sigma_1 S_b \neq \sigma_2 S_b and S_a\sigma_1 \neq S_a \sigma_2.

The proof that double forgetting does indeed describe the double cosets is simply a combination of the two arguments given above. After double forgetting, the number of balls given to each person can be recorded in an I \times J table. The N_{ij} entry of the table is simply the number of balls person p_i receives of colour c_j. Two permutations are the same after double forgetting if and only if they produce the same table. For example, \sigma_1 and \sigma_2 above both produce the following table

Green (c_1) Red (c_2)Blue (c_3)Total
Person p_11012
Person p_21102
Person p_31102
Total3216

By the definition of how the balls are coloured and distributed to each person we must have for all 1 \le i \le I and 1 \le j \le J

\sum_{j=1}^J N_{ij} = a_i and \sum_{i=1}^I N_{ij} = b_j

An I \times J table with entries N_{ij} \in \{0,1,2,\ldots\} satisfying the above conditions is called a contingency table. Given such a contingency table with entries N_{ij} where the rows sum to a and the columns sum to b, there always exists at least one permutation \sigma such that N_{ij} is the number of balls received by person p_i of colour c_i. We have already seen that two permutations produce the same table if and only if they are in the same double coset. Thus, the double cosets S_a \backslash S_n /S_b are in one-to-one correspondence with such contingency tables.

The hypergeometric distribution

I would like to end this blog post with a little bit of probability and relate the contingency tables above to the hyper geometric distribution. If a = (m, n-m) for some m \in \{0,1,\ldots,n\}, then the contingency tables described above have two rows and are determined by the values N_{11}, N_{12},\ldots, N_{1J} in the first row. The numbers N_{1j} are the number of balls of colour c_j the first person receives. Since the balls are drawn without replacement, this means that if we put the uniform distribution on S_n, then the vector Y=(N_{11}, N_{12},\ldots, N_{1J}) follows the multivariate hypergeometric distribution. Thus, if we have a random walk on S_n that quickly converges to the uniform distribution on S_n, then we could use the double cosets to get a random walk that converges to the multivariate hypergeometric distribution (although there are smarter ways to do such sampling).

The field with one element in a probability seminar

Something very exciting this afternoon. Professor Persi Diaconis was presenting at the Stanford probability seminar and the field with one element made an appearance. The talk was about joint work with Mackenzie Simper and Arun Ram. They had developed a way of “collapsing” a random walk on a group to a random walk on the set of double cosets. As an illustrative example, Persi discussed a random walk on GL_n(\mathbb{F}_q) given by multiplication by a random transvection (a map of the form v \mapsto v + ab^Tv, where a^Tb = 0).

The Bruhat decomposition can be used to match double cosets of GL_n(\mathbb{F}_q) with elements of the symmetric group S_n. So by collapsing the random walk on GL_n(\mathbb{F}_q) we get a random walk on S_n for all prime powers q. As Professor Diaconis said, you can’t stop him from taking q \to 1 and asking what the resulting random walk on S_n is. The answer? Multiplication by a random transposition. As pointed sets are vector spaces over the field with one element and the symmetric groups are the matrix groups, this all fits with what’s expected of the field with one element.

This was just one small part of a very enjoyable seminar. There was plenty of group theory, probability, some general theory and engaging examples.

Update: I have written another post about some of the group theory from the seminar! You can read it here: Double cosets and contingency tables.

Working with ABC Radio’s API in R

This post is about using ABC Radio’s API to create a record of all the songs played on Triple J in a given time period. An API (application programming interface) is a connection between two computer programs. Many websites have API’s that allow users to access data from that website. ABC Radio has an API that lets users access a record of the songs played on the station Triple J since around May 2014. Below I’ll show how to access this information and how to transform it into a format that’s easier to work with. All code can be found on GitHub here.

Packages

To access the API in R I used the packages “httr” and “jsonlite”. To transform the data from the API I used the packages “tidyverse” and “lubridate”.

library(httr)
library(jsonlite)
library(tidyverse)
library(lubridate)

ABC Radio’s API

The URL for the ABC radio’s API is:

https://music.abcradio.net.au /api/v1/plays/search.json.

If you type this into a web browser, you can see a bunch of text which actually lists information about the ten most recently played songs on ABC’s different radio stations. To see only the songs played on Triple J, you can go to:

https://music.abcradio.net.au/api/v1/plays/search.json?station=triplej.

The extra text “station=triplej” is called a parameter. We can add more parameters such as “from” and “to”. The link:

https://music.abcradio.net.au/api/v1/plays/search.json?station=triplej&from=2021-01-16T00:00:00.000Z&to=2021-01-16T00:30:00.000Z

shows information about the songs played in the 30 minutes after midnight UTC on the 16th of January last year. The last parameter that we will use is limit which specifies the number of songs to include. The link:

https://music.abcradio.net.au/api/v1/plays/search.json?station=triplej&limit=1

includes information about the most recently played song. The default for limit is 10 and the largest possible value is 100. We’ll see later that this cut off at 100 makes downloading a lot of songs a little tricky. But first let’s see how we can access the information from ABC Radio’s API in R.

Accessing an API in R

We now know how to use ABC Radio’s API to get information about the songs played on Triple J but how do we use this information in R and how is the information stored. The function GET from the package “httr” enables us to access the API in R. The input to GET is simply the same sort of URL that we’ve been using to view the API online. The below code stores information about the 5 songs played on Triple J just before 5 am on the 5th of May 2015

res <- GET("https://music.abcradio.net.au/api/v1/plays/search.json?station=triplej&limit=5&to=2015-05-05T05:00:00.000Z")
res
# Response [https://music.abcradio.net.au/api/v1/plays/search.json?station=triplej&limit=5&to=2015-05-05T05:00:00.000Z]
# Date: 2022-03-02 23:25
# Status: 200
# Content-Type: application/json
# Size: 62.9 kB

The line “Status: 200” tells us that we have successfully grabbed some data from the API. There are many other status codes that mean different thing but for now we’ll be happy remembering that “Status: 200” means everything has worked.

The information from the API is stored within the object res. To change it from a JSON file to a list we can use the function “fromJSON” from the library jsonlite. This is done below

data <- fromJSON(rawToChar(res$content))
names(data)
# "total"  "offset" "items"

The information about the individual songs is stored under items. There is a lot of information about each song including song length, copyright information and a link to an image of the album cover. For now, we’ll just try to find the name of the song, the name of the artist and the time the song played. We can find each of these under “items”. Finding the song title and the played time are pretty straight forward:

data$items$recording$title
# "First Light" "Dog"         "Begin Again" "Rot"         "Back To You"
data$items$played_time
# "2015-05-05T04:55:56+00:00" "2015-05-05T04:52:21+00:00" "2015-05-05T04:47:52+00:00" "2015-05-05T04:41:00+00:00" "2015-05-05T04:38:36+00:00"

Finding the artist name is a bit trickier.

data$items$recording$artists
# [[1]]
# entity       arid          name artwork
# 1 Artist maQeOWJQe1 Django Django    NULL
# links
# 1 Link, mlD5AM2R5J, http://musicbrainz.org/artist/4bfce038-b1a0-4bc4-abe1-b679ab900f03, 4bfce038-b1a0-4bc4-abe1-b679ab900f03, MusicBrainz artist, NA, NA, NA, service, musicbrainz, TRUE
# is_australian    type role
# 1            NA primary   NA
# 
# [[2]]
# entity       arid      name artwork
# 1 Artist maXE59XXz7 Andy Bull    NULL
# links
# 1 Link, mlByoXMP5L, http://musicbrainz.org/artist/3a837db9-0602-4957-8340-05ae82bc39ef, 3a837db9-0602-4957-8340-05ae82bc39ef, MusicBrainz artist, NA, NA, NA, service, musicbrainz, TRUE
# is_australian    type role
# 1            NA primary   NA
# ....
# ....

We can see that each song actually has a lot of information about each artist but we’re only interested in the artist name. By using “map_chr()” from the library “tidyverse” we can grab each song’s artist name.

map_chr(data$items$recording$artists, ~.$name[1])
# [1] "Django Django" "Andy Bull"     "Purity Ring"   "Northlane"     "Twerps"

Using name[1] means that if there are multiple artists, then we only select the first one. With all of these in place we can create a tibble with the information about these five songs.

list <- data$items
tb <- tibble(
     song_title = list$recording$title,
     artist = map_chr(list$recording$artists, ~.$name[1]),
     played_time = ymd_hms(list$played_time)
) %>% 
  arrange(played_time)
tb
# # A tibble: 5 x 3
# song_title  artist        played_time        
# <chr>       <chr>         <dttm>             
# 1 Back To You Twerps        2015-05-05 04:38:36
# 2 Rot         Northlane     2015-05-05 04:41:00
# 3 Begin Again Purity Ring   2015-05-05 04:47:52
# 4 Dog         Andy Bull     2015-05-05 04:52:21
# 5 First Light Django Django 2015-05-05 04:55:56

Downloading lots of songs

The maximum number of songs we can access from the API at one time is 100. This means that if we want to download a lot of songs we’ll need to use some sort of loop. Below is some code which takes in two dates and times in UTC and fetches all the songs played between the two times. The idea is simply to grab all the songs played in a five hour interval and then move on to the next five hour interval. I found including an optional value “progress” useful for the debugging. This code is from the file get_songs.r on my GitHub.

download <- function(from, to){
  base <- "https://music.abcradio.net.au/api/v1/plays/search.json?limit=100&offset=0&page=0&station=triplej&from="
  from_char <- format(from, "%Y-%m-%dT%H:%M:%S.000Z")
  to_char <- format(to, "%Y-%m-%dT%H:%M:%S.000Z")
  url <- paste0(base,
                from_char,
                "&to=",
                to_char)
  res <- GET(url)
  if (res$status == 200) {
    data <- fromJSON(rawToChar(res$content))
    list <- data$items
    tb <- tibble(
      song_title = list$recording$title,
      artist = map_chr(list$recording$artists, ~.$name[1]),
      played_time = ymd_hms(list$played_time)
    ) %>% 
      arrange(played_time)
    return(tb)
  }
}

get_songs <- function(start, end, progress = FALSE){
  from <- ymd_hms(start)
  to <- from + dhours(5)
  end <- ymd_hms(end)
  
  songs <- NULL
  
  while (to < end) {
    if (progress) {
      print(from)
      print(to)
    }
    tb <- download(from, to)
    songs <- bind_rows(songs, tb)
    from <- to + dseconds(1)
    to <- to + dhours(5)
  }
  tb <- download(from, end)
  songs <- bind_rows(songs, tb)
  return(songs)
}

An example

Using the functions defined above, we can download all the sounds played in the year 2021 (measured in AEDT).

songs <- get_songs(ymd_hms("2020-12-31 13:00:01 UTC"),
                   ymd_hms("2021-12-31 12:59:59 UTC",
                   progress = TRUE)

This code takes a little while to run so I have saved the output as a .csv file. There are lots of things I hope to do with this data such as looking at the popularity of different songs over time. For example, here’s a plot showing the cumulative plays of Genesis Owusu’s top six songs.

I’ve also looked at using the data from last year to get a list of Triple J’s most played artists. Unfortunately, it doesn’t line up with the official list. The issue is that I’m only recording the primary artist on each song and not any of the secondary artists. Hopefully I’ll address this in a later blog post! I would also like to an analysis of how plays on Triple J correlate with song ranking in the Hottest 100.

Maximum likelihood estimation and the method of moments

Maximum likelihood and the method of moments are two ways to estimate parameters from data. In general, the two methods can differ but for one-dimensional exponential families they produce the same estimates.

Suppose that \{P_\theta\}_{\theta \in \Omega} is a one-dimensional exponential family written in canonical form. That is, \Omega \subseteq \mathbb{R} and there exists a reference measure \mu such each distribution P_\theta has a density p_\theta with respect to \mu and,

p_\theta(x) = h(x)\exp\left(\theta T(x)-A(\theta)\right).

The random variable T(X) is a sufficient statistic for the model X \sim P_\theta. The function A(\theta) is the log-partition function for the family \{P_\theta\}_{\theta \in \Omega}. The condition, \int p_\theta(x)\mu(dx)=1 implies that

A(\theta) = \log\left(\int h(x)\exp(\theta T(x))\mu(dx) \right).

It turns out that the function A(\theta) is differentiable and that differentiation and integration are exchangeable. This implies that

A'(\theta) = \frac{\int h(x)\frac{d}{d\theta}\exp(\theta T(x))\mu(dx)}{\int h(x)\exp(\theta T(x))\mu(dx)} = \frac{\int h(x)\frac{d}{d\theta}\exp(\theta T(x))\mu(dx)}{\exp(A(\theta))}

Note that \int h(x)\frac{d}{d\theta}\exp(\theta T(x))\mu(dx) = \int T(x)h(x)\exp(\theta T(x))\mu(dx). Thus,

A'(\theta) = \int T(x)h(x) \exp(\theta T(x)-A(\theta))\mu(dx) = \int T(x)p_\theta(x)\mu(dx).

This means that A'(\theta) = \mathbb{E}_\theta[T(X)], the expectation of T(X) under X \sim P_\theta.

Now suppose that we have an i.i.d. sample X_1,\ldots, X_n \sim P_\theta and we want to use this sample to estimate \theta. One way to estimate \theta is by maximum likelihood. That is, we choose the value of \theta that maximises the likelihood,

L(\theta|X_1,\ldots,X_n) = \prod_{i=1}^n p_\theta(X_i).

When using the maximum likelihood estimator, it is often easier to work with the log-likelihood. The log-likelihood is,

\log L(\theta|X_1,\ldots,X_n) = \sum_{i=1}^n \log\left(p_\theta(X_i)\right) = \sum_{i=1}^n \log(h(X_i))+\theta T(X_i) - A(\theta).

Maximising the likelihood is equivalent to maximising the log-likelihood. For exponential families, the log-likelihood is a concave function of \theta. Thus the maximisers can be found be differentiation and solving the first order equations. Note that,

\frac{d}{d\theta} \log L(\theta|X_1,\ldots,X_n) =\sum_{i=1}^n T(X_i)-A'(\theta) = -nA'(\theta) + \sum_{i=1}^n T(X_i).

Thus the maximum likelihood estimate (MLE) \widehat{\theta} solves the equation,

-nA'(\widehat{\theta}) + \sum_{i=1}^n T(X_i) = 0.

But recall that A'(\widehat{\theta}) = \mathbb{E}_{\widehat{\theta}}[T(X)]. Thus the MLE is the solution to the equation,

\mathbb{E}_{\widehat{\theta}}[T(X)] = \frac{1}{n}\sum_{i=1}^n T(X_i).

Thus the MLE is the value of \theta for which the expectation of T(X) matches the empirical average from our sample. That is, the maximum likelihood estimator for an exponential family is a method of moments estimator. Specifically, the maximum likelihood estimator matches the moments of the sufficient statistic T(X).

A counter example

It is a special property of maximum likelihood estimators that the MLE is a method of moments estimator for the sufficient statistic. When we leave the nice world of exponential families, the estimators may differ.

Suppose that we have data X_1,\ldots,X_n \sim P_\theta where P_\theta is the uniform distribution on [0,\theta]. A minimal sufficient statistic for this model is X_{(n)} – the maximum of X_1,\ldots, X_n. Given what we saw before, we might imague that the MLE for this model would be a method of moments estimator for X_{(n)} but this isn’t the case.

The likelihood for X_1,\ldots,X_n is,

L(\theta|X_1,\ldots,X_n) = \begin{cases} \frac{1}{\theta^n} & \text{if } X_{(n)} \le \theta,\\ 0 & \text{if } X_{(n)} > \theta. \end{cases}

Thus the MLE is \widehat{\theta} = X_{(n)}. However, under P_\theta, \frac{1}{\theta}X_{(n)} has a \text{Beta}(n,1) distribution. Thus, \mathbb{E}_\theta[X_{(n)}]  =   \theta \frac{n}{n+1} so the method of moments estimator would be \widehat{\theta}' = \frac{n+1}{n}X_{(n)} \neq X_{(n)} = \widehat{\theta}.

A not so simple conditional expectation

It is winter 2022 and my PhD cohort has moved on the second quarter of our first year statistics courses. This means we’ll be learning about generalised linear models in our applied course, asymptotic statistics in our theory course and conditional expectations and martingales in our probability course.

In the first week of our probability course we’ve been busy defining and proving the existence of the conditional expectation. Our approach has been similar to how we constructed the Lebesgue integral in the previous course. Last quarter, we first defined the Lebesgue integral for simple functions, then we used a limiting argument to define the Lebesgue integral for non-negative functions and then finally we defined the Lebesgue integral for arbitrary functions by considering their positive and negative parts.

Our approach to the conditional expectation has been similar but the journey has been different. We again started with simple random variables, then progressed to non-negative random variables and then proved the existence of the conditional expectation of any arbitrary integrable random variable. Unlike the Lebesgue integral, the hardest step was proving the existence of the conditional expectation of a simple random variable. Progressing from simple random variables to arbitrary random variables was a straight forward application of the monotone convergence theorem and linearity of expectation. But to prove the existence of the conditional expectation of a simple random variable we needed to work with projections in the Hilbert space L^2(\Omega, \mathbb{P},\mathcal{F}).

Unlike the Lebesgue integral, defining the conditional expectation of a simple random variable is not straight forward. One reason for this is that the conditional expectation of a random variable need not be a simple random variable. This comment was made off hand by our Professor and sparked my curiosity. The following example is what I came up with. Below I first go over some definitions and then we dive into the example.

A simple random variable with a conditional expectation that is not simple

Let (\Omega, \mathbb{P}, \mathcal{F}) be a probability space and let \mathcal{G} \subseteq \mathcal{F} be a sub-\sigma-algebra. The conditional expectation of an integrable random variable X is a random variable \mathbb{E}(X|\mathcal{G}) that satisfies the following two conditions:

  1. The random variable \mathbb{E}(X|\mathcal{G}) is \mathcal{G}-measurable.
  2. For all B \in \mathcal{G}, \mathbb{E}[X1_B] = \mathbb{E}[\mathbb{E}(X|\mathcal{G})1_B], where 1_B is the indicator function of B.

The conditional expectation of an integrable random variable is unique and always exists. One can think of \mathbb{E}(X|\mathcal{G}) as the expected value of X given the information in \mathcal{G}.

A simple random variable is a random variable X that take only finitely many values. Simple random variables are always integrable and so \mathbb{E}(X|\mathcal{G}) always exists but we will see that \mathbb{E}(X|\mathcal{G}) need not be simple.

Consider a random vector (U,V) uniformly distributed on the square [-1,1]^2 \subseteq \mathbb{R}^2. Let D be the unit disc D = \{(u,v) \in \mathbb{R}^2:u^2+v^2 \le 1\}. The random variable X = 1_D(U,V) is a simple random variable since X equals 1 if (U,V) \in D and X equals 0 otherwise. Let \mathcal{G} = \sigma(U) the \sigma-algebra generated by U. It turns out that

\mathbb{E}(X|\mathcal{G}) = \sqrt{1-U^2}.

Thus \mathbb{E}(X|\mathcal{G}) is not a simple random variable. Let Y = \sqrt{1-U^2}. Since Y is a continuous function of U, the random variable is \mathcal{G}-measurable. Thus Y satisfies condition 1. Furthermore if B \in \mathcal{G}, then B = \{U \in A\} for some measurable set A\subseteq [-1,1]. Thus X1_B equals 1 if and only if U \in A and V \in [-\sqrt{1-U^2}, \sqrt{1-U^2}]. Since (U,V) is uniformly distributed we thus have

\mathbb{E}[X1_B] = \int_A \int_{-\sqrt{1-u^2}}^{\sqrt{1-u^2}} \frac{1}{4}dvdu = \int_A \frac{1}{2}\sqrt{1-u^2}du.

The random variable U is uniformly distributed on [-1,1] and thus has density \frac{1}{2}1_{[-1,1]}. Therefore,

\mathbb{E}[Y1_B] = \mathbb{E}[\sqrt{1-U^2}1_{\{U \in A\}}] = \int_A \frac{1}{2}\sqrt{1-u^2}du.

Thus \mathbb{E}[X1_B] = \mathbb{E}[Y1_B] and therefore Y = \sqrt{1-U^2} equals \mathbb{E}(X|\mathcal{G}). Intuitively we can see this because given U=u, we know that X is 1 when V \in [-\sqrt{1-u^2},\sqrt{1+u^2}] and that X is 0 otherwise. Since V is uniformly distributed on [-1,1] the probability that V is in [-\sqrt{1-u^2},\sqrt{1+u^2}] is \sqrt{1-u^2}. Thus given U=u, the expected value of X is \sqrt{1-u^2}.

An extension

The previous example suggests an extension that shows just how “complicated” the conditional expectation of a simple random variable can be. I’ll state the extension as an exercise:

Let f:[-1,1]\to \mathbb{R} be any continuous function with f(x) \in [0,1]. With (U,V) and \mathcal{G} as above show that there exists a measurable set A \subseteq [-1,1]^2 such that \mathbb{E}(1_A|\mathcal{G}) = f(U).

Sports climbing at the 2020 Olympics

Last year sports climbing made its Olympic debut. My partner Claire is an avid climber and we watched some of the competition together. The method of ranking the competitors had some surprising consequences which are the subject of this blog.

In the finals, the climbers had to compete in three disciplines – speed climbing, bouldering and lead climbing. Each climber was given a rank for each discipline. For speed climbing, the rank was given based on how quickly the climbers scaled a set route. For bouldering and lead climbing, the rank was based on how far they climbed on routes none of them had seen before.

To get a final score, the ranks from the three disciplines are multiplied together. The best climber is the one with the lowest final score. For example, if a climber was 5th in speed climbing, 2nd in bouldering and 3rd in lead climbing, then they would have a final score of 5\times 2 \times 3 = 30. A climber who came 8th in speed, 1st in bouldering and 2nd in lead climbing would have a lower (and thus better) final score of 8 \times 1 \times 2 = 16.

One thing that makes this scoring system interesting is that your final score is very dependent on how well your competitors perform. This was most evident during Jakob Schubert’s lead climb in the men’s final.

Lead climbing is the final discipline and Schubert was the last climber. This meant that ranks for speed climbing and bouldering were already decided. However, the final rankings of the climbers fluctuated hugely as Schubert scaled the 15 metre lead wall. This was because if a climber had done really well in the boulder and speed rounds, being overtaken by Schubert didn’t increase their final score by that much. However, if a climber did poorly in the previous rounds, then being overtaken by Schubert meant their score increased by a lot and they plummeted down the rankings. This can be visualised in the following plot:

Along the x-axis we have how far up the lead wall Schubert climbed (measured by the “hold” he had reached). On the y-axis we have the finalists’ ranking at different stages of Schubert’s climb. We can see that as Schubert climbed and overtook the other climbers the whole ranking fluctuated. Here is a similar plot which also shows when Schubert overtook each competitor on the lead wall:

The volatility of the rankings can really be seen by following Adam Ondra’s ranking (in purple). When Schubert started his climb, Ondra was ranked second. After Schubert passed Albert Gines Lopez, Ondra was ranked first. But then Schubert passed Ondra, Ondra finished the event in 6th place and Gines Lopez came first. If you go to the 4 hour and 27 minutes mark here you can watch Schubert’s climb and hear the commentator explain how both Gines Lopez and Ondra are in the running to win gold.

Similar things happened in the women’s finals. Janja Garnbret was the last lead climber in the women’s final. Here is the same plot which shows the climbers’ final rankings and lead ranking.

Garnbret was a favourite at the Olympics and had come fifth in the speed climbing and first in bouldering. This meant that as long as she didn’t come last in the lead climbing she would at least take home silver and otherwise she’ll get the gold. Garnbret ended up coming first in the lead climbing and we can see that as she overtook the last few climbers, their ranking fluctuated wildly.

Here is one more plot which shows each competitor’s final score at different points of Garnbret’s climb.

In the plot you can really see that, depending on how they performed in the previous two events, each climber’s score changed by a different amount once they were overtaken. It also shows how Garnbret is just so ahead of the competition – especially when you compare the men’s and women’s finals. Here is the same plot for the men. You can see that the men’s final scores were a lot closer together.

Before I end this post, I would like to make one comment about the culture of sport climbing. In this post I wanted to highlight how tumultuous and complicated the sport climbing rankings were but if you watched the athletes you’d have no idea the stakes were so high. The climbers celebrate their personal bests as if no one was watching, they trade ideas on how to tackle the lead wall and the day after the final, they returned to the bouldering wall to try the routes together. Sports climbing is such a friendly and welcoming sport and I would hate for my analysis to give anyone the wrong idea.

Visualising Strava data with R

I’ve recently had some fun downloading and displaying my running data from Strava. I’ve been tracking my runs on Strava for the last five years and I thought it would be interesting to make a map showing where I run. Here is one of the plots I made. Each circle is a place in south east Australia where I ran in the given year. The size of the circle corresponds to how many hours I ran there.

I think the plot is a nice visual diary. Looking at the plot, you can see that most of my running took place in my hometown Canberra and that the time I spend running has been increasing. The plot also shows that most years I’ve spent some time running on the south coast and on my grandfather’s farm near Kempsey. You can also see my trips in 2019 and 2020 to the AMSI Summer School. In 2019, the summer school was hosted by UNSW in Sydney and in 2020 it was hosted by La Trobe University in Melbourne. You can also see a circle from this year at Mt Kosciuszko where I ran the Australian Alpine Ascent with my good friend Sarah.

I also made a plot of all my runs on a world map which shows my recent move to California. In this plot all the circles are the same size and I grouped all the runs across the five different years.

I learnt a few things creating these plots and so I thought I would document how I made them.


Creating the plots

  • Strava lets you download all your data by doing a bulk export. The export includes a zipped folder with all your activities in their original file format.
  • My activities where saved as .gpx files and I used this handy python library to convert them to .csv files which I could read into R. For the R code I used the packages “tidyverse”, “maps” and “ozmaps”.
  • Now I had a .csv files for each run. In these files each row corresponded to my location at a given second during the run. What I wanted was a single data frame where each row corresponded to a different run. I found the following way to read in and edit each .csv file:
files <- list.files(path = ".", pattern = "*.csv")
listcsv <- lapply(files, function(x) read_csv(paste0(x)) %>% 
                    select(lat, lon, time) %>% 
                    mutate(hours = n()/3600) %>% 
                    filter(row_number() == 1)
                  )

The first line creates a list of all the .csv files in the working directory. The second line then goes through the list of file names and converts each .csv file into a tibble. I then selected the rows with the time and my location and added a new column with the duration of the run in hours. Finally I removed all the rows except the first row which contains the information about where my run started.

  • Next I combined these separate tibbles into a single tibble using rbind(). I then added some new columns for grouping the runs. I added a column with the year and columns with the longitude and latitude rounded to the nearest whole number.
runs <- do.call(rbind, listcsv) %>% 
  mutate(year = format(time,"%Y"),
         approx_lat = round(lat),
         approx_lon = round(lon))
  • To create the plot where you can see where I ran each year, I grouped the runs by the approximate location and by year. I then calculated the total time spent running at each location each year and calculated the average longitude and latitude. I also removed the runs in the USA by only keeping the runs with a negative latitude.
run_counts_year <- runs %>% 
  group_by(approx_lat, approx_lon, year) %>% 
  summarise(hours = sum(hours),
            lat = mean(lat),
            lon = mean(lon),
            .groups = "drop") %>% 
  select(!approx_lat & !approx_lon)

oz_counts_year <- run_counts_year %>% 
  filter(lat < 0)
  • I then used the package “ozmaps” to plot my running locations on a map of the ACT, New South Wales and Victoria.
oz_states <- ozmaps::ozmap_states %>% 
  filter(NAME == "New South Wales" |
         NAME == "Victoria" |
         NAME == "Australian Capital Territory")

ggplot() + 
  geom_sf(data = oz_states) +
  coord_sf() +
  geom_point(data = oz_counts_year,
             mapping = aes(x = lon, 
                           y = lat,
                           size = hours),
            color = "blue",
            shape = 21) +
  facet_wrap(~ year) +
  theme_bw() 
  • Creating the world map was similar except I didn’t group by year and I kept the runs with positive latitude.
run_counts <- runs %>% 
  group_by(approx_lat, approx_lon) %>% 
  summarise(hours = sum(hours),
            lat = mean(lat),
            lon = mean(lon),
            .groups = "drop") %>% 
  select(!approx_lat & !approx_lon)

world <- map_data("world")
ggplot() +
  geom_polygon(data = world,
               mapping = aes(x = long, y = lat, group = group),
               fill = "lightgrey",
               color = "black",
               size = 0.1) +
  geom_point(data = run_counts,
             mapping = aes(x = lon, 
                           y = lat),
             size = 2,
             shape = 1,
             color = "blue") +
  theme_bw()

Extremal couplings

This post is inspired by an assignment question I had to answer for STATS 310A – a probability course at Stanford for first year students in the statistics PhD program. In the question we had to derive a few results about couplings. I found myself thinking and talking about the question long after submitting the assignment and decided to put my thoughts on paper. I would like to thank our lecturer Prof. Diaconis for answering my questions and pointing me in the right direction.

What are couplings?

Given two distribution functions F and G on \mathbb{R}, a coupling of F and G is a distribution function H on \mathbb{R}^2 such that the marginals of H are F and G. Couplings can be used to give probabilistic proofs of analytic statements about F and G (see here). Couplings are also are studied in their own right in the theory optimal transport.

We can think of F and G as being the cumulative distribution functions of some random variables X and Y. A coupling H of F and G thus corresponds to a random vector (\widetilde{X},\widetilde{Y}) where \widetilde{X} has the same distribution as X, \widetilde{Y} has the same distribution as Y and (\widetilde{X},\widetilde{Y})  \sim H.

The independent coupling

For two given distributions function F and G there exist many possible couplings. For example we could take H = H_I where H_I(x,y) = F(x)G(y). This coupling corresponds to a random vector (\widetilde{X}_I,\widetilde{Y}_I) where \widetilde{X}_I and \widetilde{Y}_I are independent and (as is required for all couplings) \widetilde{X}_I \stackrel{\text{dist}}{=} X, \widetilde{Y}_I \stackrel{\text{dist}}{=} Y.

In some sense the coupling H_I is in the “middle” of all couplings. This is because \widetilde{X} and \widetilde{Y} are independent and so \widetilde{X} doesn’t carry any information about \widetilde{Y}. As the title of the post suggests, there are couplings were this isn’t the case and \widetilde{X} carries “as much information as possible” about \widetilde{Y}.

The two extremal couplings

Define two function H_L, H_U :\mathbb{R}^2 \to [0,1] by

H_U(x,y) = \min\{F(x), G(y)\} and H_L(x,y) = \max\{F(x)+G(y) - 1, 0\}.

With some work, one can show that H_L and H_U are distributions functions on \mathbb{R}^2 and that they have the correct marginals. In this post I would like to talk about how to construct random vectors (\widetilde{X}_U, \widetilde{Y}_U) \sim H_U and (\widetilde{X}_L, \widetilde{Y}_L) \sim H_L.

Let F^{-1} and G^{-1} be the quantile functions of F and G. That is,

F^{-1}(c) = \inf\{ x \in \mathbb{R} : F(x) \ge c\} and G^{-1}(c) = \inf\{ x \in \mathbb{R} : G(x) \ge c\}.

Now let V be a random variable that is uniformly distributed on [0,1] and define

\widetilde{X}_U = F^{-1}(V) and \widetilde{Y}_U = G^{-1}(V).

Since F^{-1}(V) \le x if and only if V \le F(x), we have \widetilde{X}_U \stackrel{\text{dist}}{=} X and likewise \widetilde{Y}_U \stackrel{\text{dist}}{=} Y. Furthermore \widetilde{X}_U \le x, \widetilde{Y}_U \le y occurs if and only if V \le F(x), V \le G(y) which is equivalent to V \le \min\{F(x),G(y)\}. Thus

\mathbb{P}(\widetilde{X}_U \le x, \widetilde{Y}_U \le y) = \mathbb{P}(V \le \min\{F(x),G(y)\})= \min\{F(x),G(y)\}.

Thus (\widetilde{X}_U,\widetilde{Y}_U) is distributed according to H_U. We see that under the coupling H_U, \widetilde{X}_U and \widetilde{Y}_U are closely related as they are both increasing functions of a common random variable V.

We can follow a similar construction for H_L. Define

\widetilde{X}_L = F^{-1}(V) and \widetilde{Y}_L = G^{-1}(1-V).

Thus \widetilde{X}_L and \widetilde{Y}_L are again functions of a common random variable V but \widetilde{X}_L is an increasing function of V and \widetilde{Y}_L is a decreasing function of V. Note that 1-V is also uniformly distributed on [0,1]. Thus \widetilde{X}_L \stackrel{\text{dist}}{=} X and \widetilde{Y}_L \stackrel{\text{dist}}{=} Y.

Now \widetilde{X}_L \le x, \widetilde{Y}_L \le y occurs if and only if V \le F(x) and 1-V \le G(y) which occurs if and only if 1-G(y) \le V \le F(x). If 1-G(y) \le F(x), then F(x)+G(y)-1 \ge 0 and \mathbb{P}(1-G(y) \le V \le F(x)) =F(x)+G(y)-1. On the other hand, if 1 - G(y) > F(x), then F(x)+G(y)-1< 0 and \mathbb{P}(1-G(y) \le V \le F(x))=0. Thus

\mathbb{P}(\widetilde{X}_L \le x, \widetilde{Y}_L \le y) = \mathbb{P}(1-G(y) \le V \le F(x)) = \max\{F(x)+G(y)-1,0\},

and so (\widetilde{X}_L,\widetilde{Y}_L) is distributed according to H_L.

What makes H_U and H_L extreme?

Now that we know that H_U and H_L are indeed couplings, it is natural to ask what makes them “extreme”. What we would like to say is that \widetilde{Y}_U is an increasing function of \widetilde{X}_U and \widetilde{Y}_L is a decreasing function of \widetilde{X}_L. Unfortunately this isn’t always the case as can be seen by taking X to be constant and Y to be continuous.

However the intuition that \widetilde{Y}_U is increasing in \widetilde{X}_U and \widetilde{Y}_L is decreasing in \widetilde{X}_L is close to correct. Given a coupling (\widetilde{X},\widetilde{Y}) \sim H, we can look at the quantity

C(x,y) = \mathbb{P}(\widetilde{Y} \le y | \widetilde{X} \le x) -\mathbb{P}(\widetilde{Y} \le y) = \frac{H(x,y)}{F(x)}-G(y)

This quantity tells us something about how \widetilde{Y} changes with \widetilde{X}. For instance if \widetilde{X} and \widetilde{Y} were positively correlated, then C(x,y) would be positive and if \widetilde{X} and \widetilde{Y} were negatively correlated, then C(x,y) would be negative.

For the independent coupling (\widetilde{X}_I,\widetilde{Y}_I) \sim H_I, the quantity C(x,y) is constantly 0. It turns out that the above probability is maximised by the coupling (\widetilde{X}_U, \widetilde{Y}_U) \sim H_U and minimised by (\widetilde{X}_L,\widetilde{Y}_L) \sim H_L and it is in this sense that they are extremal. This final claim is the two dimensional version of the Fréchet-Hoeffding Theorem and checking it is a good exercise.

You could have proved the Neyman-Pearson lemma

The Neyman-Pearson lemma is foundational and important result in the theory of hypothesis testing. When presented in class the proof seemed magical and I had no idea where the ideas came from. I even drew a face like this 😲 next to the usual \square in my book when the proof was finished. Later in class we learnt the method of undetermined multipliers and suddenly I saw where the Neyman-Pearson lemma came from.

In this blog post, I’ll first give some background and set up notation for the Neyman-Pearson lemma. Then I’ll talk about the method of undetermined multipliers and show how it can be used to derive and prove the Neyman-Pearson lemma. Finally, I’ll write about why I still think the Neyman-Pearson lemma is magical despite the demystified proof.

Background

In the set up of the Neyman-Pearson lemma we have data x which is a realisation of some random variable X. We wish to conclude something about the distribution of X from our data x by doing a hypothesis test.

In the Neyman-Pearson lemma we have simple hypotheses. That is our data either comes from the distribution \mathbb{P}_0 or from the distribution \mathbb{P}_1. Thus, our null hypothesis is H_0 : X \sim \mathbb{P}_0 and our alternative hypothesis is H_1 : X \sim \mathbb{P}_1.

A test of H_0 against H_1 is a function \phi that takes in data x and returns a number \phi(x) \in [0,1]. The value \phi(x) is the probability under the test \phi of rejecting H_0 given the observed data x. That is, if \phi(x) = 1, we always reject H_0 and if \phi(x)=0 we never reject H_0. For in-between values \phi(x) = \gamma \in (0,1), we reject H_0 with probability \gamma.

An ideal test would have two desirable properties. We would like a test that rejects H_0 with a low probability when H_0 is true but we would also like the test to reject H_0 with a high probability when H_1 is true. To state this more formally, let \mathbb{E}_0[\phi(X)] and \mathbb{E}_1[\phi(X)] be the expectation of \phi(X) under H_0 and H_1 respectively. The quantity \mathbb{E}_0[\phi(X)] is the probability we reject H_0 when H_0 is true. Likewise, the quantity \mathbb{E}_1[\phi(X)] is the probability we reject H_0 when H_1 is true. An optimal test would be one that minimises \mathbb{E}_0[\phi(X)] and maximises \mathbb{E}_1[\phi(X)].

Unfortunately the goals of minimising \mathbb{E}_0[\phi(X)] and maximising \mathbb{E}_1[\phi(X)] are at odds with one another. This is because we want \phi to be small in order to minimise \mathbb{E}_0[\phi(X)] and we want \phi to be large to maximise \mathbb{E}_1[\phi(X)]. In nearly all cases we have to trade off between these two goals and there is no single test that simultaneously achieves both.

To work around this, a standard approach is to focus on maximising \mathbb{E}_1[\phi(X)] while requiring that \mathbb{E}_0[\phi(X)] remains below some threshold. The quantity \mathbb{E}_1[\phi(X)] is called the power of the test \phi. If \alpha is a number between 0 and 1, we will say that \phi has level\alpha if \mathbb{E}_1[\phi(X)] \le \alpha. A test \phi is said to be most powerful at level-\alpha, if

  • The test \phi is level-\alpha.
  • For all level-\alpha tests \phi', the test \phi is more powerful than \phi'. That is,

\mathbb{E}_1[\phi'(X)] \le \mathbb{E}_1[\phi(X)].

Thus we can see that finding a most powerful level-\alpha test is a constrained optimisation problem. We wish to maximise the quantity

\mathbb{E}_1[\phi(X)]

subject to the constraint

\mathbb{E}_0[\phi(X)] \le \alpha

With this in mind, we turn to the method of undetermined multipliers.

The method of undetermined multipliers

The method of undetermined multipliers (also called the method of Lagrange multipliers) is a very general optimisation tool. Suppose that we have a set U and two function f,g : U \to \mathbb{R} and we wish to maximise f(u) subject to the constraint g(u) \le 0.

In the context of hypothesis testing, the set U is the set of all tests \phi. The objective function f is given by f(\phi) = \mathbb{E}_1[\phi(X)]. That is, f(\phi) is the power of the test \phi. The constraint function g is given by g(\phi)=\mathbb{E}_1[\phi(X)]-\alpha so that g(\phi) \le 0 if and only if \phi has level-\alpha.

The method of undetermined multipliers allows us to reduce this constrained optimisation problem to a (hopefully easier) unconstrained optimisation problem. More specifically we have the following result:

Proposition: Suppose that u^* \in U is such that:

  • g(u^*) = 0,
  • There exists k \ge 0, such that u^* maximises f(u)-kg(u) over all u \in U.

Then u maximises f(u) under the constraint g(u) \le 0.

Proof: Suppose that u^* satisfies the above two dot points. We need to show that for all u \in U, if g(u) \le 0, then f(u) \le f(u^*). By assumption we know that f(u^*)=0 and u^* maximises f(u)-kg(u). Thus, for all u \in U,

f(u^*)=f(u^*)-kg(u^*) \ge f(u)-kg(u).

Now suppose g(u) \le 0. Then, kg(u) \le 0 and so f(u)-kg(u)\ge f(u) and so f(u^*) \ge f(u) as needed. \square

The constant k is the undetermined multiplier. The way to use the method of undetermined is to find values u_k^* that maximise h_k(u) = f(u)-kg(u) for each k \ge 0. The multiplier k is then varied so that the constraint g(u^*_k) = 0 is satisfied.

Proving the Neyman-Pearson lemma

Now let’s use the method of undetermined multipliers to find most powerful tests. Recall the set U which we are optimising over is the set of all tests \phi. Recall also that we wish to optimise f(\phi) = \mathbb{E}_1[\phi(X)] subject to the constraint g(\phi) = \mathbb{E}_0[\phi(X)] - \alpha \le 0. The method of undetermined multipliers says that we should consider maximising the function

h_k(\phi) = \mathbb{E}_1[\phi(X)] - k\mathbb{E}_0[\phi(X)],

where k \ge 0. Suppose that both \mathbb{P}_0 and \mathbb{P}_1 have densities1 p_0 and p_1 with respect to some measure \mu. We can we can write the above expectations as integrals. That is,

\mathbb{E}_0[\phi(X)] = \int \phi(x)p_0(x)\mu(dx) and \mathbb{E}_1[\phi(X)] = \int \phi(x)p_1(x)\mu(dx).

Thus the function h_k is equal to

h_k(\phi) =  \int \phi(x)p_1(x)\mu(dx)- k\int \phi(x)p_0(x)\mu(dx)=\int \phi(x)(p_1(x)-kp_0(x))\mu(dx).

We now wish to maximise the function h_k(\phi). Recall that \phi is a function that take values in [0,1]. Thus, the integral

\int \phi(x)(p_1(x)-kp_0(x))\mu(dx),

is maximised if and only if \phi(x)=1 when p_1(x)-kp_0(x) > 0 and \phi(x)=0 when p_1(x)-kp_0(x) < 0. Note that p_1(x)-kp_0(x) > 0 if and only if \frac{p_1(x)}{p_0(x)} > k. Thus for each k \ge 0, a test \phi_k maximises h_k(\phi) if and only if

\phi_k(x) = \begin{cases} 1 & \text{if } \frac{p_1(x)}{p_0(x)} > k,\\ 0 &\text{if } \frac{p_1(x)}{p_0(x)} < k. \end{cases}

The method of undetermined multipliers says that if we can find k so that the above is satisfied and g(\phi_k) = 0, then \phi_k is a most powerful test. Recall that g(\phi_k)=0 is equivalent to \mathbb{E}_1[\phi(X)]=\alpha. By summarising the above argument, we arrive at the Neyman-Pearson lemma,

Neyman-Pearson Lemma2: Suppose that \phi is a test such that

  • \mathbb{E}_0[\phi(X)] = \alpha, and
  • For some k \ge 0, \phi(x) = \begin{cases} 1 & \text{if } \frac{p_1(x)}{p_0(x)} > k,\\ 0 & \text{if } \frac{p_1(x)}{p_0(x)}< k.\end{cases}

then \phi is most powerful at level-\alpha for testing H_0 : X \sim \mathbb{P}_0 against H_1 : X \sim \mathbb{P}_1.

The magic of Neyman-Pearson

By learning about undetermined multipliers I’ve been able to better understand the proof of the Neyman-Pearson lemma. I now view it is as clever solution to a constrained optimisation problem rather than something that comes out of nowhere.

There is, however, a different aspect of Neyman-Pearson that continues to surprise me. This aspect is the usefulness of the lemma. At first glance the Neyman-Pearson lemma seems to be a very specialised result because it is about simple hypothesis testing. In reality most interesting hypothesis tests have composite nulls or composite alternatives or both. It turns out that Neyman-Pearson is still incredibly useful for composite testing. Through ideas like monotone likelihood ratios, least favourable distributions and unbiasedness, the Neyman-Pearson lemma or similar ideas can be used to find optimal tests in a variety of settings.

Thus I must admit that the title of this blog post is a little inaccurate and deceptive. I do believe that, given the tools of undetermined multipliers and the set up of simple hypothesis testing, one is naturally led to the Neyman-Pearson lemma. However, I don’t believe that many could have realised how useful and interesting simple hypothesis testing would be.

Footnotes

  1. The assumption that \mathbb{P}_0 and \mathbb{P}_1 have densities with respect to a common measure \mu is not a restrictive assumption since one can always take \mu = \mathbb{P}_0+\mathbb{P}_1 and the apply Radon-Nikodym. However there is often a more natural choice of \mu such as Lebesgue measure on \mathbb{R}^d or the counting measure on \mathbb{N}.
  2. What I call the Neyman-Pearson lemma is really only a third of the Neyman-Pearson lemma. There are two other parts. One that guarantees the existence of a most powerful test and one that is a partial converse to the statement in this post.

An art and maths collaboration

Over the course of the past year I have had the pleasure to work with the artist Sanne Carroll on her honours project at the Australian National University. I was one of two mathematics students that collaborated with Sanne. Over the course of the project Sanne drew patterns and would ask Ciaran and I to recreate them using some mathematical or algorithmic ideas. You can see the final version of project here: https://www.sannecarroll.com/ (best viewed on a computer).

I always loved the patterns Sanne drew and the final project is so well put together. Sanne does a great job of incorporating her drawings, the mathematical descriptions and the communication between her, Ciaran and me. Her website building skills also far surpass anything I’ve done on this blog!

It was also a lot of fun to work with Sanne. Hearing about her patterns and talking about maths with her was always fun. I also learnt a few things about GeoGebra which made the animations in my previous post a lot quicker to make. Sanne has told me that she’ll be starting a PhD soon and I’m looking forward to any future collaborations that might arise.