## Log-sum-exp trick with negative numbers

Suppose you want to calculate an expression of the form

$\displaystyle{\log\left(\sum_{i=1}^n \exp(a_i) - \sum_{j=1}^m \exp(b_j)\right)},$

where $\sum_{i=1}^n \exp(a_i) > \sum_{j=1}^m \exp(b_j)$. Such expressions can be difficult to evaluate directly since the exponentials can easily cause overflow errors. In this post, I’ll talk about a clever way to avoid such errors.

If there were no terms in the second sum we could use the log-sum-exp trick. That is, to calculate

$\displaystyle{\log\left(\sum_{i=1}^n \exp(a_i) \right)},$

we set $a=\max\{a_i : 1 \le i \le n\}$ and use the identity

$\displaystyle{a + \log\left(\sum_{i=1}^n \exp(a_i-a)\right) = \log\left(\sum_{i=1}^n \exp(a_i)\right)}.$

Since $a_i \le a$ for all $i =1,\ldots,n$, the left hand side of the above equation can be computed without the risk of overflow. To calculate,

$\displaystyle{\log\left(\sum_{i=1}^n \exp(a_i) - \sum_{j=1}^m \exp(b_j)\right)},$

we can use the above method to separately calculate

$\displaystyle{A = \log\left(\sum_{i=1}^n \exp(a_i) \right)}$ and $\displaystyle{B = \log\left(\sum_{j=1}^m \exp(b_j)\right)}.$

The final result we want is

$\displaystyle{\log\left(\sum_{i=1}^n \exp(a_i) - \sum_{j=1}^m \exp(b_j) \right) = \log(\exp(A) - \exp(B)) = A + \log(1-\exp(A-B))}.$

Since $A > B$, the right hand side of the above expression can be evaluated safely and we will have our final answer.

### R code

The R code below defines a function that performs the above procedure

# Safely compute log(sum(exp(pos)) - sum(exp(neg)))
# The default value for neg is an empty vector.

logSumExp <- function(pos, neg = c()){
max_pos <- max(pos)
A <- max_pos + log(sum(exp(pos - max_pos)))

# If neg is empty, the calculation is done
if (length(neg) == 0){
return(A)
}

# If neg is non-empty, calculate B.
max_neg <- max(neg)
B <- max_neg + log(sum(exp(neg - max_neg)))

# Check that A is bigger than B
if (A <= B) {
stop("sum(exp(pos)) must be larger than sum(exp(neg))")
}

# log1p() is a built in function that accurately calculates log(1+x) for |x| << 1
return(A + log1p(-exp(B - A)))
}

### An example

The above procedure can be used to evaulate

$\displaystyle{\log\left(\sum_{i=1}^{200} i! - \sum_{i=1}^{500} \binom{500}{i}^2\right)}.$

Evaluating this directly would quickly lead to errors since R (and most other programming languages) cannot compute $200!$. However, R has the functions lfactorial() and lchoose() which can compute $\log(i!)$ and $\log\binom{n}{i}$ for large values of $i$ and $n$. We can thus put this expression in the general form at the start of this post

$\displaystyle{\log\left(\sum_{i=1}^{200} \exp(\log(i!)) - \sum_{i=1}^{500} \exp\left(2\log\binom{500}{i}\right)\right)}.$

The following R code thus us exactly what we want:

pos <- sapply(1:200, lfactorial)
neg <- sapply(1:500, function(i){2*lchoose(500, i)})

logSumExp(pos, neg)
# 863.237

## Finding Australia’s youngest electorates with R

My partner recently wrote an article for Changing Times, a grassroots newspaper that focuses on social change. Her article, Who’s not voting? Engaging with First Nations voters and young voters, is about voter turn-out in Australia and an important read.

While doing research for the article, she wanted to know which electorates in Australia had the highest proportion of young voters. Fortunately the Australian Electoral Commission (AEC) keeps a detailed record of the number of electors in each electorate available here. The records list the number of voters of a given sex and age bracket in each of Australia’s 153 electorates. To calculate and sort the proportion of young voters (18-29) in each electorate using the most recent records, I wrote the below R code for her:

library(tidyverse)

skip = 8,
col_names = c("Description",
"18-19",
"20-24",
"25-29",
"30-34",
"35-39",
"40-44",
"45-49",
"50-54",
"55-59",
"60-64",
"65-69",
"70+",
"Total"),
col_select = (1:14),
col_types = cols(Description = "c",
.default = "n"))

not_electorates = c("NSW", "VIC", "ACT", "WA", "SA", "TAS",
"QLD", "NT", "Grand Total","Female",
"Male", "Indeterminate/Unknown")

electorates <- voters %>%
filter(!(Description %in% not_electorates),
!is.na(Description))

young_voters <- electorates %>%
mutate(Young = 18-19 + 20-24,
Proportion = Young/Total,
rank = min_rank(desc(Proportion))) %>%
arrange(rank) %>%
select(Electorate = Description, Total, Young, Proportion, rank)

young_voters

To explain what I did, I’ll first describe the format of the data set from the AEC. The first column contained a description of each row. All the other columns contained the number of voters in a given age bracket. Rows either corresponded to an electorate or a state and either contained the total electorate or a given sex.

For the article my partner wanted to know which electorate had the highest proportion of young voters (18-24) in total so we removed the rows for each state and the rows that selected a specific sex. The next step was to calculate the number of young voters across the age brackets 18-19 and 20-24 and then calculate the proportion of such voters. Once ranked, the five electorates with the highest proportion of young voters were

In Who’s not Voting?, my partner points out that the three seats with the highest proportion of voters all swung to the Greens at the most recent election. To view all the proportion of young voters across every electorate, you can download this spreadsheet.

## Working with ABC Radio’s API in R

This post is about using ABC Radio’s API to create a record of all the songs played on Triple J in a given time period. An API (application programming interface) is a connection between two computer programs. Many websites have API’s that allow users to access data from that website. ABC Radio has an API that lets users access a record of the songs played on the station Triple J since around May 2014. Below I’ll show how to access this information and how to transform it into a format that’s easier to work with. All code can be found on GitHub here.

### Packages

To access the API in R I used the packages “httr” and “jsonlite”. To transform the data from the API I used the packages “tidyverse” and “lubridate”.

library(httr)
library(jsonlite)
library(tidyverse)
library(lubridate)

The URL for the ABC radio’s API is:

If you type this into a web browser, you can see a bunch of text which actually lists information about the ten most recently played songs on ABC’s different radio stations. To see only the songs played on Triple J, you can go to:

The extra text “station=triplej” is called a parameter. We can add more parameters such as “from” and “to”. The link:

shows information about the songs played in the 30 minutes after midnight UTC on the 16th of January last year. The last parameter that we will use is limit which specifies the number of songs to include. The link:

includes information about the most recently played song. The default for limit is 10 and the largest possible value is 100. We’ll see later that this cut off at 100 makes downloading a lot of songs a little tricky. But first let’s see how we can access the information from ABC Radio’s API in R.

### Accessing an API in R

We now know how to use ABC Radio’s API to get information about the songs played on Triple J but how do we use this information in R and how is the information stored. The function GET from the package “httr” enables us to access the API in R. The input to GET is simply the same sort of URL that we’ve been using to view the API online. The below code stores information about the 5 songs played on Triple J just before 5 am on the 5th of May 2015

res <- GET("https://music.abcradio.net.au/api/v1/plays/search.json?station=triplej&limit=5&to=2015-05-05T05:00:00.000Z")
res
# Date: 2022-03-02 23:25
# Status: 200
# Content-Type: application/json
# Size: 62.9 kB


The line “Status: 200” tells us that we have successfully grabbed some data from the API. There are many other status codes that mean different thing but for now we’ll be happy remembering that “Status: 200” means everything has worked.

The information from the API is stored within the object res. To change it from a JSON file to a list we can use the function “fromJSON” from the library jsonlite. This is done below

data <- fromJSON(rawToChar(res$content)) names(data) # "total" "offset" "items" The information about the individual songs is stored under items. There is a lot of information about each song including song length, copyright information and a link to an image of the album cover. For now, we’ll just try to find the name of the song, the name of the artist and the time the song played. We can find each of these under “items”. Finding the song title and the played time are pretty straight forward: data$items$recording$title
# "First Light" "Dog"         "Begin Again" "Rot"         "Back To You"
data$items$played_time
# "2015-05-05T04:55:56+00:00" "2015-05-05T04:52:21+00:00" "2015-05-05T04:47:52+00:00" "2015-05-05T04:41:00+00:00" "2015-05-05T04:38:36+00:00"

Finding the artist name is a bit trickier.

data$items$recording$artists # [[1]] # entity arid name artwork # 1 Artist maQeOWJQe1 Django Django NULL # links # 1 Link, mlD5AM2R5J, http://musicbrainz.org/artist/4bfce038-b1a0-4bc4-abe1-b679ab900f03, 4bfce038-b1a0-4bc4-abe1-b679ab900f03, MusicBrainz artist, NA, NA, NA, service, musicbrainz, TRUE # is_australian type role # 1 NA primary NA # # [[2]] # entity arid name artwork # 1 Artist maXE59XXz7 Andy Bull NULL # links # 1 Link, mlByoXMP5L, http://musicbrainz.org/artist/3a837db9-0602-4957-8340-05ae82bc39ef, 3a837db9-0602-4957-8340-05ae82bc39ef, MusicBrainz artist, NA, NA, NA, service, musicbrainz, TRUE # is_australian type role # 1 NA primary NA # .... # .... We can see that each song actually has a lot of information about each artist but we’re only interested in the artist name. By using “map_chr()” from the library “tidyverse” we can grab each song’s artist name. map_chr(data$items$recording$artists, ~.$name[1]) # [1] "Django Django" "Andy Bull" "Purity Ring" "Northlane" "Twerps" Using name[1] means that if there are multiple artists, then we only select the first one. With all of these in place we can create a tibble with the information about these five songs. list <- data$items
tb <- tibble(
song_title = list$recording$title,
artist = map_chr(list$recording$artists, ~.$name[1]), played_time = ymd_hms(list$played_time)
) %>%
arrange(played_time)
tb
# # A tibble: 5 x 3
# song_title  artist        played_time
# <chr>       <chr>         <dttm>
# 1 Back To You Twerps        2015-05-05 04:38:36
# 2 Rot         Northlane     2015-05-05 04:41:00
# 3 Begin Again Purity Ring   2015-05-05 04:47:52
# 4 Dog         Andy Bull     2015-05-05 04:52:21
# 5 First Light Django Django 2015-05-05 04:55:56

The maximum number of songs we can access from the API at one time is 100. This means that if we want to download a lot of songs we’ll need to use some sort of loop. Below is some code which takes in two dates and times in UTC and fetches all the songs played between the two times. The idea is simply to grab all the songs played in a five hour interval and then move on to the next five hour interval. I found including an optional value “progress” useful for the debugging. This code is from the file get_songs.r on my GitHub.

download <- function(from, to){
from_char <- format(from, "%Y-%m-%dT%H:%M:%S.000Z")
to_char <- format(to, "%Y-%m-%dT%H:%M:%S.000Z")
url <- paste0(base,
from_char,
"&to=",
to_char)
res <- GET(url)
if (res$status == 200) { data <- fromJSON(rawToChar(res$content))
list <- data$items tb <- tibble( song_title = list$recording$title, artist = map_chr(list$recording$artists, ~.$name[1]),
played_time = ymd_hms(list\$played_time)
) %>%
arrange(played_time)
return(tb)
}
}

get_songs <- function(start, end, progress = FALSE){
from <- ymd_hms(start)
to <- from + dhours(5)
end <- ymd_hms(end)

songs <- NULL

while (to < end) {
if (progress) {
print(from)
print(to)
}
songs <- bind_rows(songs, tb)
from <- to + dseconds(1)
to <- to + dhours(5)
}
songs <- bind_rows(songs, tb)
return(songs)
}

### An example

Using the functions defined above, we can download all the sounds played in the year 2021 (measured in AEDT).

songs <- get_songs(ymd_hms("2020-12-31 13:00:01 UTC"),
ymd_hms("2021-12-31 12:59:59 UTC",
progress = TRUE)


This code takes a little while to run so I have saved the output as a .csv file. There are lots of things I hope to do with this data such as looking at the popularity of different songs over time. For example, here’s a plot showing the cumulative plays of Genesis Owusu’s top six songs.

I’ve also looked at using the data from last year to get a list of Triple J’s most played artists. Unfortunately, it doesn’t line up with the official list. The issue is that I’m only recording the primary artist on each song and not any of the secondary artists. Hopefully I’ll address this in a later blog post! I would also like to an analysis of how plays on Triple J correlate with song ranking in the Hottest 100.

## Sports climbing at the 2020 Olympics

Last year sports climbing made its Olympic debut. My partner Claire is an avid climber and we watched some of the competition together. The method of ranking the competitors had some surprising consequences which are the subject of this blog.

In the finals, the climbers had to compete in three disciplines – speed climbing, bouldering and lead climbing. Each climber was given a rank for each discipline. For speed climbing, the rank was given based on how quickly the climbers scaled a set route. For bouldering and lead climbing, the rank was based on how far they climbed on routes none of them had seen before.

To get a final score, the ranks from the three disciplines are multiplied together. The best climber is the one with the lowest final score. For example, if a climber was 5th in speed climbing, 2nd in bouldering and 3rd in lead climbing, then they would have a final score of $5\times 2 \times 3 = 30$. A climber who came 8th in speed, 1st in bouldering and 2nd in lead climbing would have a lower (and thus better) final score of $8 \times 1 \times 2 = 16$.

One thing that makes this scoring system interesting is that your final score is very dependent on how well your competitors perform. This was most evident during Jakob Schubert’s lead climb in the men’s final.

Lead climbing is the final discipline and Schubert was the last climber. This meant that ranks for speed climbing and bouldering were already decided. However, the final rankings of the climbers fluctuated hugely as Schubert scaled the 15 metre lead wall. This was because if a climber had done really well in the boulder and speed rounds, being overtaken by Schubert didn’t increase their final score by that much. However, if a climber did poorly in the previous rounds, then being overtaken by Schubert meant their score increased by a lot and they plummeted down the rankings. This can be visualised in the following plot:

Along the x-axis we have how far up the lead wall Schubert climbed (measured by the “hold” he had reached). On the y-axis we have the finalists’ ranking at different stages of Schubert’s climb. We can see that as Schubert climbed and overtook the other climbers the whole ranking fluctuated. Here is a similar plot which also shows when Schubert overtook each competitor on the lead wall:

The volatility of the rankings can really be seen by following Adam Ondra’s ranking (in purple). When Schubert started his climb, Ondra was ranked second. After Schubert passed Albert Gines Lopez, Ondra was ranked first. But then Schubert passed Ondra, Ondra finished the event in 6th place and Gines Lopez came first. If you go to the 4 hour and 27 minutes mark here you can watch Schubert’s climb and hear the commentator explain how both Gines Lopez and Ondra are in the running to win gold.

Similar things happened in the women’s finals. Janja Garnbret was the last lead climber in the women’s final. Here is the same plot which shows the climbers’ final rankings and lead ranking.

Garnbret was a favourite at the Olympics and had come fifth in the speed climbing and first in bouldering. This meant that as long as she didn’t come last in the lead climbing she would at least take home silver and otherwise she’ll get the gold. Garnbret ended up coming first in the lead climbing and we can see that as she overtook the last few climbers, their ranking fluctuated wildly.

Here is one more plot which shows each competitor’s final score at different points of Garnbret’s climb.

In the plot you can really see that, depending on how they performed in the previous two events, each climber’s score changed by a different amount once they were overtaken. It also shows how Garnbret is just so ahead of the competition – especially when you compare the men’s and women’s finals. Here is the same plot for the men. You can see that the men’s final scores were a lot closer together.

Before I end this post, I would like to make one comment about the culture of sport climbing. In this post I wanted to highlight how tumultuous and complicated the sport climbing rankings were but if you watched the athletes you’d have no idea the stakes were so high. The climbers celebrate their personal bests as if no one was watching, they trade ideas on how to tackle the lead wall and the day after the final, they returned to the bouldering wall to try the routes together. Sports climbing is such a friendly and welcoming sport and I would hate for my analysis to give anyone the wrong idea.

## Visualising Strava data with R

I’ve recently had some fun downloading and displaying my running data from Strava. I’ve been tracking my runs on Strava for the last five years and I thought it would be interesting to make a map showing where I run. Here is one of the plots I made. Each circle is a place in south east Australia where I ran in the given year. The size of the circle corresponds to how many hours I ran there.

I think the plot is a nice visual diary. Looking at the plot, you can see that most of my running took place in my hometown Canberra and that the time I spend running has been increasing. The plot also shows that most years I’ve spent some time running on the south coast and on my grandfather’s farm near Kempsey. You can also see my trips in 2019 and 2020 to the AMSI Summer School. In 2019, the summer school was hosted by UNSW in Sydney and in 2020 it was hosted by La Trobe University in Melbourne. You can also see a circle from this year at Mt Kosciuszko where I ran the Australian Alpine Ascent with my good friend Sarah.

I also made a plot of all my runs on a world map which shows my recent move to California. In this plot all the circles are the same size and I grouped all the runs across the five different years.

I learnt a few things creating these plots and so I thought I would document how I made them.

## Creating the plots

• Strava lets you download all your data by doing a bulk export. The export includes a zipped folder with all your activities in their original file format.
• My activities where saved as .gpx files and I used this handy python library to convert them to .csv files which I could read into R. For the R code I used the packages “tidyverse”, “maps” and “ozmaps”.
• Now I had a .csv files for each run. In these files each row corresponded to my location at a given second during the run. What I wanted was a single data frame where each row corresponded to a different run. I found the following way to read in and edit each .csv file:
files <- list.files(path = ".", pattern = "*.csv")
listcsv <- lapply(files, function(x) read_csv(paste0(x)) %>%
select(lat, lon, time) %>%
mutate(hours = n()/3600) %>%
filter(row_number() == 1)
)


The first line creates a list of all the .csv files in the working directory. The second line then goes through the list of file names and converts each .csv file into a tibble. I then selected the rows with the time and my location and added a new column with the duration of the run in hours. Finally I removed all the rows except the first row which contains the information about where my run started.

• Next I combined these separate tibbles into a single tibble using rbind(). I then added some new columns for grouping the runs. I added a column with the year and columns with the longitude and latitude rounded to the nearest whole number.
runs <- do.call(rbind, listcsv) %>%
mutate(year = format(time,"%Y"),
approx_lat = round(lat),
approx_lon = round(lon))
• To create the plot where you can see where I ran each year, I grouped the runs by the approximate location and by year. I then calculated the total time spent running at each location each year and calculated the average longitude and latitude. I also removed the runs in the USA by only keeping the runs with a negative latitude.
run_counts_year <- runs %>%
group_by(approx_lat, approx_lon, year) %>%
summarise(hours = sum(hours),
lat = mean(lat),
lon = mean(lon),
.groups = "drop") %>%
select(!approx_lat & !approx_lon)

oz_counts_year <- run_counts_year %>%
filter(lat < 0)
• I then used the package “ozmaps” to plot my running locations on a map of the ACT, New South Wales and Victoria.
oz_states <- ozmaps::ozmap_states %>%
filter(NAME == "New South Wales" |
NAME == "Victoria" |
NAME == "Australian Capital Territory")

ggplot() +
geom_sf(data = oz_states) +
coord_sf() +
geom_point(data = oz_counts_year,
mapping = aes(x = lon,
y = lat,
size = hours),
color = "blue",
shape = 21) +
facet_wrap(~ year) +
theme_bw() 
• Creating the world map was similar except I didn’t group by year and I kept the runs with positive latitude.
run_counts <- runs %>%
group_by(approx_lat, approx_lon) %>%
summarise(hours = sum(hours),
lat = mean(lat),
lon = mean(lon),
.groups = "drop") %>%
select(!approx_lat & !approx_lon)

world <- map_data("world")
ggplot() +
geom_polygon(data = world,
mapping = aes(x = long, y = lat, group = group),
fill = "lightgrey",
color = "black",
size = 0.1) +
geom_point(data = run_counts,
mapping = aes(x = lon,
y = lat),
size = 2,
shape = 1,
color = "blue") +
theme_bw()