Research Plumbing II

More things you never worry about until they break.

Francis J. DiTraglia

University of Oxford

If Research Plumbing were a Trilogy:

Merging Data

Joins - A Simple Example

Common task: merging or joining multiple datasets.
gradebook contains student id, name, and grades
emails contains student id and email addresses
Merge them together by the common student id variable
dplyr mutating joins

`gradebook`

library(tidyverse)
set.seed(92815)

gradebook <- tibble(
  student_id = c(192297, 291857, 500286, 449192, 372152, 627561), 
  name = c('Alice', 'Bob', 'Charlotte', 'Dante', 
           'Ethelburga', 'Felix'),
  quiz1 = round(rnorm(6, 65, 15)), quiz2 = round(rnorm(6, 88, 5)),
  quiz3 = round(rnorm(6, 75, 10)), midterm1 = round(rnorm(6, 75, 10)),
  midterm2 = round(rnorm(6, 80, 8)), final = round(rnorm(6, 78, 11)))

gradebook

# A tibble: 6 × 8
  student_id name       quiz1 quiz2 quiz3 midterm1 midterm2 final
       <dbl> <chr>      <dbl> <dbl> <dbl>    <dbl>    <dbl> <dbl>
1     192297 Alice         64    96    68       81       90    99
2     291857 Bob           58    91    91       75       75    79
3     500286 Charlotte     70    94    71       81       70    74
4     449192 Dante         57    85    84       83       94    83
5     372152 Ethelburga    74    91    70       63       73    96
6     627561 Felix         77    86    68       78       83    75

`emails`

emails <- tibble(
  student_id = c(101198, 192297, 372152, 918276, 291857), 
  email = c('unclejoe@whitehouse.gov', 'alice.liddell@chch.ox.ac.uk',
            'ethelburga@lyminge.org', 'mzuckerberg@gmail.com',
            'microsoftbob@hotmail.com'))
emails

# A tibble: 5 × 2
  student_id email                      
       <dbl> <chr>                      
1     101198 unclejoe@whitehouse.gov    
2     192297 alice.liddell@chch.ox.ac.uk
3     372152 ethelburga@lyminge.org     
4     918276 mzuckerberg@gmail.com      
5     291857 microsoftbob@hotmail.com

`dplyr` Mutating Joins

dplyr features several different kinds of joins
The right one depends on what you’re trying to accomplish
All join functions have two required arguments:
- x and y are the tibbles we will merge
left_join(x, y) returns tibble with same # of rows as x
You will learn about the other joins as an exercise.

`left_join()`

Merge gradebook and emails on the common student_id column

left_join(gradebook, emails)

# A tibble: 6 × 9
  student_id name       quiz1 quiz2 quiz3 midterm1 midterm2 final email         
       <dbl> <chr>      <dbl> <dbl> <dbl>    <dbl>    <dbl> <dbl> <chr>         
1     192297 Alice         64    96    68       81       90    99 alice.liddell…
2     291857 Bob           58    91    91       75       75    79 microsoftbob@…
3     500286 Charlotte     70    94    71       81       70    74 <NA>          
4     449192 Dante         57    85    84       83       94    83 <NA>          
5     372152 Ethelburga    74    91    70       63       73    96 ethelburga@ly…
6     627561 Felix         77    86    68       78       83    75 <NA>

Returns tibble with same number of rows as gradebook
Retains students whose id appears in gradebook but not email.
Drops students whose id appears in emails but not gradebook.
Unknown email addresses become NA.

💪 Exercise A - (10 min)

Answer the following, consulting the dplyr help files as needed. First copy, paste and run my code to generate gradebook four slides previous.

Run right_join(gradebook, emails). What happens? Explain.
Run full_join(gradebook, emails). What happens? Explain.
Run inner_join(gradebook, emails). What happens? Explain.
Above I ran left_join(gradebook, emails). How could I have used the pipe?
Add a column called name to the emails tibble, containing the following names in order: c('Joe', 'Alice', 'Ethelburga', 'Mark', 'Bob'). Then use a left join to merge gradebook with emails. What happens? Now try setting the parameter by = 'student_id'. What changes?

Reshaping Data

Pivoting: From Wide to Long and Back

Sometimes need to reshape data before plotting or analysis.
pivot_longer() and pivot_wider() from tidyr.
tidyr is part of the tidyverse: already installed
For more details see this article on pivoting.

`pivot_longer()`

How to make a time series plot of quiz scores for each student in gradebook?
Need the dataset to be arranged like a panel–sequence of observations for same individual made over time–but that’s not how gradebook is set up.
Need to increase the number of rows to each student appears multiple times, once for each quiz.

`pivot_longer()`

gradebook |> 
  pivot_longer(starts_with('quiz'), names_to = 'quiz', 
               values_to = 'score') |> 
  select(student_id, name, quiz, score)

# A tibble: 18 × 4
   student_id name       quiz  score
        <dbl> <chr>      <chr> <dbl>
 1     192297 Alice      quiz1    64
 2     192297 Alice      quiz2    96
 3     192297 Alice      quiz3    68
 4     291857 Bob        quiz1    58
 5     291857 Bob        quiz2    91
 6     291857 Bob        quiz3    91
 7     500286 Charlotte  quiz1    70
 8     500286 Charlotte  quiz2    94
 9     500286 Charlotte  quiz3    71
10     449192 Dante      quiz1    57
11     449192 Dante      quiz2    85
12     449192 Dante      quiz3    84
13     372152 Ethelburga quiz1    74
14     372152 Ethelburga quiz2    91
15     372152 Ethelburga quiz3    70
16     627561 Felix      quiz1    77
17     627561 Felix      quiz2    86
18     627561 Felix      quiz3    68

`pivot_longer()`

gradebook |> 
  pivot_longer(starts_with('quiz'), names_to = 'quiz', 
               values_to = 'score') |> 
  select(student_id, name, quiz, score)

Arguments:
- data is the tibble we will pivot; here supplied by |>
- cols chooses the columns to pivot (tidyselect!)
- names_to tells how to handle column names from cols
- values_to is the column name in which to store values from cols
What’s going on here?
- Start with three separate columns: quiz1, quiz2, quiz3
- End up with two columns quiz and score but three times as many rows
- The column names from cols are like the “time periods” in the panel

`names_prefix`

gradebook |> 
  pivot_longer(starts_with('quiz'), 
               names_to = 'quiz', 
               names_prefix = 'quiz',
               values_to = 'score') |> 
  select(student_id, name, quiz, score)

# A tibble: 18 × 4
   student_id name       quiz  score
        <dbl> <chr>      <chr> <dbl>
 1     192297 Alice      1        64
 2     192297 Alice      2        96
 3     192297 Alice      3        68
 4     291857 Bob        1        58
 5     291857 Bob        2        91
 6     291857 Bob        3        91
 7     500286 Charlotte  1        70
 8     500286 Charlotte  2        94
 9     500286 Charlotte  3        71
10     449192 Dante      1        57
11     449192 Dante      2        85
12     449192 Dante      3        84
13     372152 Ethelburga 1        74
14     372152 Ethelburga 2        91
15     372152 Ethelburga 3        70
16     627561 Felix      1        77
17     627561 Felix      2        86
18     627561 Felix      3        68

`names_transform`

gradebook |> 
  pivot_longer(starts_with('quiz'), 
               names_to = 'quiz', 
               names_prefix = 'quiz',
               names_transform = list(quiz = as.numeric),
               values_to = 'score') |> 
  select(student_id, name, quiz, score)

# A tibble: 18 × 4
   student_id name        quiz score
        <dbl> <chr>      <dbl> <dbl>
 1     192297 Alice          1    64
 2     192297 Alice          2    96
 3     192297 Alice          3    68
 4     291857 Bob            1    58
 5     291857 Bob            2    91
 6     291857 Bob            3    91
 7     500286 Charlotte      1    70
 8     500286 Charlotte      2    94
 9     500286 Charlotte      3    71
10     449192 Dante          1    57
11     449192 Dante          2    85
12     449192 Dante          3    84
13     372152 Ethelburga     1    74
14     372152 Ethelburga     2    91
15     372152 Ethelburga     3    70
16     627561 Felix          1    77
17     627561 Felix          2    86
18     627561 Felix          3    68

Making the plot

gradebook |> 
  pivot_longer(starts_with('quiz'), 
               names_to = 'quiz', 
               names_prefix = 'quiz',
               names_transform = list(quiz = as.numeric),
               values_to = 'score') |> 
  select(student_id, name, quiz, score) |> 
  ggplot(aes(x = quiz, y = score)) +
  geom_line() +
  geom_point() +
  facet_wrap(~ name) +
  theme_minimal()

Making the plot

`pivot_wider()`

pivot_wider() and pivot_longer() are inverses
- pivot_longer() creates a tibble with more rows
- pivot_wider() creates a tibble with fewer rows

First store and display our results from above:

quiz_scores <- gradebook |> 
  pivot_longer(starts_with('quiz'), 
               names_to = 'quiz', 
               names_prefix = 'quiz',
               names_transform = list(quiz = as.numeric),
               values_to = 'score') |>  
  select(student_id, name, quiz, score)

quiz_scores

`pivot_wider()`

# A tibble: 18 × 4
   student_id name        quiz score
        <dbl> <chr>      <dbl> <dbl>
 1     192297 Alice          1    64
 2     192297 Alice          2    96
 3     192297 Alice          3    68
 4     291857 Bob            1    58
 5     291857 Bob            2    91
 6     291857 Bob            3    91
 7     500286 Charlotte      1    70
 8     500286 Charlotte      2    94
 9     500286 Charlotte      3    71
10     449192 Dante          1    57
11     449192 Dante          2    85
12     449192 Dante          3    84
13     372152 Ethelburga     1    74
14     372152 Ethelburga     2    91
15     372152 Ethelburga     3    70
16     627561 Felix          1    77
17     627561 Felix          2    86
18     627561 Felix          3    68

`pivot_wider()`

quiz_scores |> 
  pivot_wider(names_from = quiz, names_prefix = 'quiz', 
              values_from = score)

# A tibble: 6 × 5
  student_id name       quiz1 quiz2 quiz3
       <dbl> <chr>      <dbl> <dbl> <dbl>
1     192297 Alice         64    96    68
2     291857 Bob           58    91    91
3     500286 Charlotte     70    94    71
4     449192 Dante         57    85    84
5     372152 Ethelburga    74    91    70
6     627561 Felix         77    86    68

names_from and values_from work in the reverse way in pivot_wider() compared to pivot_longer()
names_prefix is removed by pivot_longer() but added by pivot_wider()

💪 Exercise B - (15 min)

Try dropping names_prefix in the preceding example. What happens and why?
Read the help file for billboard from tidyr. Then use pivot_longer() to convert it to a “panel data layout,” as we did with gradebook above.
Use pivot_wider() to reverse your pivot_longer() transformation from part 2.
Use pivoting to plot kernel density estimates of kid.score and mom.iq from kids in a single graph.
Add a column to gradebook called quiz_avg that equals a student’s average across the three quizzes dropping the lowest score. Hint: pivot twice.

Parallel computing with `furrr`

😳 Embarassingly Parallel Problems

Serial: one step at a time; Parallel: many steps at once.
The CPU in your laptop almost certainly has multiple cores, smaller processors within the CPU.
If so, your CPU can execute multiple instructions at the same time, i.e. in parallel, by spreading them across cores.
A computing problem is called embarassingly parallel if it is really easy to convert from serial to parallel.
Typically: many independent tasks and little or no need to exchange information between the tasks.
Leading example: Monte Carlo simulations.

Many hands make light work. 👐 👐

Two common approaches:
1. Fix parameters; spread sim reps across cores. Repeat.
2. Spread parameters across cores; each core runs all reps.
To get speedup from (1), each sim rep must be slow.
Can always get gains from (2); use a finer parameter grid!
Doubling # of cores generally doesn’t halve the run time:
- Amdahl’s Law: only part of your code is parallel.
- Load Balancing: if a core has less work, it finishes early
- Communication overhead, hard/software limitations

`future` and `furrr`

future provides asynchronous evaluation of R expressions:
- multisession: background R sessions on current machine
- multicore: forked R processes on current machine
- cluster: external R sessions on current/ local / remote machines
furrr uses future to run purrr commands in parallel.
- Simply prefix purrr functions with future_
- E.g. future_map() or future_pmap()

The Simplest Example of `furrr`

library(tictoc) # for easy timing of code chunks
library(furrr)
wait_2_seconds <- function() {
  Sys.sleep(2) 
  'Done waiting!'
}
# Run in serial
tic()
map_chr(1:4, \(i) wait_2_seconds())

[1] "Done waiting!" "Done waiting!" "Done waiting!" "Done waiting!"

toc()

8.012 sec elapsed

# Run in parallel
plan(multisession, workers = 4)
tic()
future_map_chr(1:4, \(i) wait_2_seconds())

[1] "Done waiting!" "Done waiting!" "Done waiting!" "Done waiting!"

toc()

2.869 sec elapsed

Recall: MLE Bias Simulation

set.seed(1066)

# Function to simulate data
draw_sim_data <- function(n, s_sq) rnorm(n, sd = sqrt(s_sq))

# Function to compute the "usual" and MLE estimates
get_estimates <- function(x) {
  c('usual' = var(x), 'MLE' = mean((x - mean(x))^2)) 
}

# Function to run simulation at specified parameter values: n, s_sq
run_sim <- function(n, s_sq, nreps = 5000) { 
  sim_datasets <- map(1:nreps, \(i) draw_sim_data(n, s_sq))  
  map_dfr(sim_datasets, get_estimates)
}

Recall: MLE Bias Simulation

# Run the simulation over a grid of parameter values
sim_params <- expand_grid(n = 3:5, s_sq = 1:3)
sim_results <- pmap(sim_params, run_sim)

# Function to summarize results of simulation at a given grid point
get_summary_stats <- function(df) {
  df |> 
    summarize(usual_mean = mean(usual), 
              MLE_mean = mean(MLE), 
              usual_var = var(usual), 
              MLE_var = var(MLE))
}

# Create table with summary results and parameter values 
summary_stats <- sim_results |>
  map_dfr(get_summary_stats) |>
  bind_cols(sim_params)

summary_stats

# A tibble: 9 × 6
  usual_mean MLE_mean usual_var MLE_var     n  s_sq
       <dbl>    <dbl>     <dbl>   <dbl> <int> <int>
1      1.01     0.677     1.05    0.468     3     1
2      2.00     1.33      4.08    1.81      3     2
3      3.01     2.01      9.17    4.08      3     3
4      0.989    0.742     0.639   0.359     4     1
5      2.00     1.50      2.61    1.47      4     2
6      2.99     2.25      6.34    3.57      4     3
7      0.991    0.793     0.503   0.322     5     1
8      2.02     1.61      2.10    1.34      5     2
9      3.02     2.41      4.56    2.92      5     3

How will we parallelize this?

The simulation for each point in our parameter grid, n, s_sq, is independent of the others.
Run them in parallel using future_pmap()

`set.seed()` – Serial versus Parallel

The Mersenne Twister can only run in serial
Parallel simulations require parallel pseudo-RNG
Such things exist, but use different algorithms.
Can still set the seed, but serial and parallel will not agree
More discussion here

Timing Comparison: Serial vs 4 Workers

sim_params <- expand_grid(n = 3:12, s_sq = 1:10)
tic()
sim_results <- pmap(sim_params, run_sim)
toc()

12.34 sec elapsed

plan(multisession, workers = 4)
set.seed(1066)
my_options <- furrr_options(seed = TRUE)
tic()
sim_results <- future_pmap(sim_params, run_sim, .options = my_options)
toc()

4.972 sec elapsed

💪 Exercise C - (\(\infty\) minutes)

Use rmvnorm() to write a function that generates n draws from a bivariate standard normal distribution with correlation coefficient r. Check you work by generating a large number of simulations and calculating the sample variance-covariance matrix.
The function cov() calculates the sample covariance between \(X\) and \(Y\) as \(S_{xy} = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X}) (Y_i - \bar{Y})\). In contrast, the MLE \(\widehat{\sigma}_{xy}\) for jointly normal \((X_i, Y_i)\) divides by \(n\) rather than \((n - 1)\). Write a function that takes a matrix with two columns and n rows as its input and calculates \(\widehat{\sigma}_{xy}\).
Use the functions you wrote in the preceding two parts to carry out a simulation study investigating the bias of \(\widehat{\sigma}_{xy}\). Use 5000 replications and a parameter grid of n\(\in \{5, 10, 15, 20, 25\}\), r\(\in \{-0.5, 0.25, 0, 0.25, 0.5\}\). Try to run it in parallel. Summarize your findings.

Research Plumbing II

If Research Plumbing were a Trilogy:

Merging Data

Joins - A Simple Example

gradebook

emails

dplyr Mutating Joins

left_join()

💪 Exercise A - (10 min)

Reshaping Data

Pivoting: From Wide to Long and Back

pivot_longer()

pivot_longer()

pivot_longer()

names_prefix

names_transform

Making the plot

Making the plot

pivot_wider()

pivot_wider()

pivot_wider()

💪 Exercise B - (15 min)

Parallel computing with furrr

😳 Embarassingly Parallel Problems

Many hands make light work. 👐 👐

future and furrr

The Simplest Example of furrr

Recall: MLE Bias Simulation

Recall: MLE Bias Simulation

How will we parallelize this?

set.seed() – Serial versus Parallel

Timing Comparison: Serial vs 4 Workers

💪 Exercise C - (\(\infty\) minutes)

`gradebook`

`emails`

`dplyr` Mutating Joins

`left_join()`

`pivot_longer()`

`pivot_longer()`

`pivot_longer()`

`names_prefix`

`names_transform`

`pivot_wider()`

`pivot_wider()`

`pivot_wider()`

Parallel computing with `furrr`

`future` and `furrr`

The Simplest Example of `furrr`

`set.seed()` – Serial versus Parallel