Research Plumbing II

More things you never worry about until they break.

Francis J. DiTraglia

University of Oxford

If Research Plumbing were a Trilogy:

Merging Data

Joins - A Simple Example

  • Common task: merging or joining multiple datasets.
  • gradebook contains student id, name, and grades
  • emails contains student id and email addresses
  • Merge them together by the common student id variable
  • dplyr mutating joins

gradebook

library(tidyverse)
set.seed(92815)

gradebook <- tibble(
  student_id = c(192297, 291857, 500286, 449192, 372152, 627561), 
  name = c('Alice', 'Bob', 'Charlotte', 'Dante', 
           'Ethelburga', 'Felix'),
  quiz1 = round(rnorm(6, 65, 15)), quiz2 = round(rnorm(6, 88, 5)),
  quiz3 = round(rnorm(6, 75, 10)), midterm1 = round(rnorm(6, 75, 10)),
  midterm2 = round(rnorm(6, 80, 8)), final = round(rnorm(6, 78, 11)))

gradebook
# A tibble: 6 × 8
  student_id name       quiz1 quiz2 quiz3 midterm1 midterm2 final
       <dbl> <chr>      <dbl> <dbl> <dbl>    <dbl>    <dbl> <dbl>
1     192297 Alice         64    96    68       81       90    99
2     291857 Bob           58    91    91       75       75    79
3     500286 Charlotte     70    94    71       81       70    74
4     449192 Dante         57    85    84       83       94    83
5     372152 Ethelburga    74    91    70       63       73    96
6     627561 Felix         77    86    68       78       83    75

emails

emails <- tibble(
  student_id = c(101198, 192297, 372152, 918276, 291857), 
  email = c('unclejoe@whitehouse.gov', 'alice.liddell@chch.ox.ac.uk',
            'ethelburga@lyminge.org', 'mzuckerberg@gmail.com',
            'microsoftbob@hotmail.com'))
emails
# A tibble: 5 × 2
  student_id email                      
       <dbl> <chr>                      
1     101198 unclejoe@whitehouse.gov    
2     192297 alice.liddell@chch.ox.ac.uk
3     372152 ethelburga@lyminge.org     
4     918276 mzuckerberg@gmail.com      
5     291857 microsoftbob@hotmail.com   

dplyr Mutating Joins

  • dplyr features several different kinds of joins
  • The right one depends on what you’re trying to accomplish
  • All join functions have two required arguments:
    • x and y are the tibbles we will merge
  • left_join(x, y) returns tibble with same # of rows as x
  • You will learn about the other joins as an exercise.

left_join()

Merge gradebook and emails on the common student_id column

left_join(gradebook, emails)
# A tibble: 6 × 9
  student_id name       quiz1 quiz2 quiz3 midterm1 midterm2 final email         
       <dbl> <chr>      <dbl> <dbl> <dbl>    <dbl>    <dbl> <dbl> <chr>         
1     192297 Alice         64    96    68       81       90    99 alice.liddell…
2     291857 Bob           58    91    91       75       75    79 microsoftbob@…
3     500286 Charlotte     70    94    71       81       70    74 <NA>          
4     449192 Dante         57    85    84       83       94    83 <NA>          
5     372152 Ethelburga    74    91    70       63       73    96 ethelburga@ly…
6     627561 Felix         77    86    68       78       83    75 <NA>          
  • Returns tibble with same number of rows as gradebook
  • Retains students whose id appears in gradebook but not email.
  • Drops students whose id appears in emails but not gradebook.
  • Unknown email addresses become NA.

💪 Exercise A - (10 min)

Answer the following, consulting the dplyr help files as needed.

  1. Run right_join(gradebook, emails). What happens? Explain.
  2. Run full_join(gradebook, emails). What happens? Explain.
  3. Run inner_join(gradebook, emails). What happens? Explain.
  4. Above I ran left_join(gradebook, emails). How could I have used the pipe?
  5. Add a column called name to the emails tibble, containing the following names in order: c('Joe', 'Alice', 'Ethelburga', 'Mark', 'Bob'). Then use a left join to merge gradebook with emails. What happens? Now try setting the parameter by = 'student_id'. What changes?

Reshaping Data

Pivoting: From Wide to Long and Back

  • Sometimes need to reshape data before plotting or analysis.
  • pivot_longer() and pivot_wider() from tidyr.
  • tidyr is part of the tidyverse: already installed
  • For more details see this article on pivoting.

pivot_longer()

  • How to make a time series plot of quiz scores for each student in gradebook?
  • Need the dataset to be arranged like a panel–sequence of observations for same individual made over time–but that’s not how gradebook is set up.
  • Need to increase the number of rows to each student appears multiple times, once for each quiz.

pivot_longer()

gradebook |> 
  pivot_longer(starts_with('quiz'), names_to = 'quiz', 
               values_to = 'score') |> 
  select(student_id, name, quiz, score)
# A tibble: 18 × 4
   student_id name       quiz  score
        <dbl> <chr>      <chr> <dbl>
 1     192297 Alice      quiz1    64
 2     192297 Alice      quiz2    96
 3     192297 Alice      quiz3    68
 4     291857 Bob        quiz1    58
 5     291857 Bob        quiz2    91
 6     291857 Bob        quiz3    91
 7     500286 Charlotte  quiz1    70
 8     500286 Charlotte  quiz2    94
 9     500286 Charlotte  quiz3    71
10     449192 Dante      quiz1    57
11     449192 Dante      quiz2    85
12     449192 Dante      quiz3    84
13     372152 Ethelburga quiz1    74
14     372152 Ethelburga quiz2    91
15     372152 Ethelburga quiz3    70
16     627561 Felix      quiz1    77
17     627561 Felix      quiz2    86
18     627561 Felix      quiz3    68

pivot_longer()

gradebook |> 
  pivot_longer(starts_with('quiz'), names_to = 'quiz', 
               values_to = 'score') |> 
  select(student_id, name, quiz, score)
  • Arguments:
    • data is the tibble we will pivot; here supplied by |>
    • cols chooses the columns to pivot (tidyselect!)
    • names_to tells how to handle column names from cols
    • values_to is the column name in which to store values from cols
  • What’s going on here?
    • Start with three separate columns: quiz1, quiz2, quiz3
    • End up with two columns quiz and score but three times as many rows
    • The column names from cols are like the “time periods” in the panel

names_prefix

gradebook |> 
  pivot_longer(starts_with('quiz'), 
               names_to = 'quiz', 
               names_prefix = 'quiz',
               values_to = 'score') |> 
  select(student_id, name, quiz, score)
# A tibble: 18 × 4
   student_id name       quiz  score
        <dbl> <chr>      <chr> <dbl>
 1     192297 Alice      1        64
 2     192297 Alice      2        96
 3     192297 Alice      3        68
 4     291857 Bob        1        58
 5     291857 Bob        2        91
 6     291857 Bob        3        91
 7     500286 Charlotte  1        70
 8     500286 Charlotte  2        94
 9     500286 Charlotte  3        71
10     449192 Dante      1        57
11     449192 Dante      2        85
12     449192 Dante      3        84
13     372152 Ethelburga 1        74
14     372152 Ethelburga 2        91
15     372152 Ethelburga 3        70
16     627561 Felix      1        77
17     627561 Felix      2        86
18     627561 Felix      3        68

names_transform

gradebook |> 
  pivot_longer(starts_with('quiz'), 
               names_to = 'quiz', 
               names_prefix = 'quiz',
               names_transform = list(quiz = as.numeric),
               values_to = 'score') |> 
  select(student_id, name, quiz, score)
# A tibble: 18 × 4
   student_id name        quiz score
        <dbl> <chr>      <dbl> <dbl>
 1     192297 Alice          1    64
 2     192297 Alice          2    96
 3     192297 Alice          3    68
 4     291857 Bob            1    58
 5     291857 Bob            2    91
 6     291857 Bob            3    91
 7     500286 Charlotte      1    70
 8     500286 Charlotte      2    94
 9     500286 Charlotte      3    71
10     449192 Dante          1    57
11     449192 Dante          2    85
12     449192 Dante          3    84
13     372152 Ethelburga     1    74
14     372152 Ethelburga     2    91
15     372152 Ethelburga     3    70
16     627561 Felix          1    77
17     627561 Felix          2    86
18     627561 Felix          3    68

Making the plot

gradebook |> 
  pivot_longer(starts_with('quiz'), 
               names_to = 'quiz', 
               names_prefix = 'quiz',
               names_transform = list(quiz = as.numeric),
               values_to = 'score') |> 
  select(student_id, name, quiz, score) |> 
  ggplot(aes(x = quiz, y = score)) +
  geom_line() +
  geom_point() +
  facet_wrap(~ name) +
  theme_minimal()

Making the plot

pivot_wider()

  • pivot_wider() and pivot_longer() are inverses
    • pivot_longer() creates a tibble with more rows
    • pivot_wider() creates a tibble with fewer rows

First store and display our results from above:

quiz_scores <- gradebook |> 
  pivot_longer(starts_with('quiz'), 
               names_to = 'quiz', 
               names_prefix = 'quiz',
               names_transform = list(quiz = as.numeric),
               values_to = 'score') |>  
  select(student_id, name, quiz, score)

quiz_scores

pivot_wider()

# A tibble: 18 × 4
   student_id name        quiz score
        <dbl> <chr>      <dbl> <dbl>
 1     192297 Alice          1    64
 2     192297 Alice          2    96
 3     192297 Alice          3    68
 4     291857 Bob            1    58
 5     291857 Bob            2    91
 6     291857 Bob            3    91
 7     500286 Charlotte      1    70
 8     500286 Charlotte      2    94
 9     500286 Charlotte      3    71
10     449192 Dante          1    57
11     449192 Dante          2    85
12     449192 Dante          3    84
13     372152 Ethelburga     1    74
14     372152 Ethelburga     2    91
15     372152 Ethelburga     3    70
16     627561 Felix          1    77
17     627561 Felix          2    86
18     627561 Felix          3    68

pivot_wider()

quiz_scores |> 
  pivot_wider(names_from = quiz, names_prefix = 'quiz', 
              values_from = score)
# A tibble: 6 × 5
  student_id name       quiz1 quiz2 quiz3
       <dbl> <chr>      <dbl> <dbl> <dbl>
1     192297 Alice         64    96    68
2     291857 Bob           58    91    91
3     500286 Charlotte     70    94    71
4     449192 Dante         57    85    84
5     372152 Ethelburga    74    91    70
6     627561 Felix         77    86    68
  • names_from and values_from work in the reverse way in pivot_wider() compared to pivot_longer()
  • names_prefix is removed by pivot_longer() but added by pivot_wider()

💪 Exercise B - (15 min)

  1. Try dropping names_prefix in the preceding example. What happens and why?
  2. Read the help file for billboard from tidyr. Then use pivot_longer() to convert it to a “panel data layout,” as we did with gradebook above.
  3. Use pivot_wider() to reverse your pivot_longer() transformation from part 2.
  4. Use pivoting to plot kernel density estimates of kid.score and mom.iq from kids in a single graph.
  5. Add a column to gradebook called quiz_avg that equals a student’s average across the three quizzes dropping the lowest score. Hint: pivot twice.

Parallel computing with furrr

😳 Embarassingly Parallel Problems

  • Serial: one step at a time; Parallel: many steps at once.
  • The CPU in your laptop almost certainly has multiple cores, smaller processors within the CPU.
  • If so, your CPU can execute multiple instructions at the same time, i.e. in parallel, by spreading them across cores.
  • A computing problem is called embarassingly parallel if it is really easy to convert from serial to parallel.
  • Typically: many independent tasks and little or no need to exchange information between the tasks.
  • Leading example: Monte Carlo simulations.

Many hands make light work. 👐 👐

  • Two common approaches:
    1. Fix parameters; spread sim reps across cores. Repeat.
    2. Spread parameters across cores; each core runs all reps.
  • To get speedup from (1), each sim rep must be slow.
  • Can always get gains from (2); use a finer parameter grid!
  • Doubling # of cores generally doesn’t halve the run time:
    • Amdahl’s Law: only part of your code is parallel.
    • Load Balancing: if a core has less work, it finishes early
    • Communication overhead, hard/software limitations

future and furrr

  • future provides asynchronous evaluation of R expressions:
    • multisession: background R sessions on current machine
    • multicore: forked R processes on current machine
    • cluster: external R sessions on current/ local / remote machines
  • furrr uses future to run purrr commands in parallel.
    • Simply prefix purrr functions with future_
    • E.g. future_map() or future_pmap()

The Simplest Example of furrr

library(tictoc) # for easy timing of code chunks
library(furrr)
wait_2_seconds <- function() {
  Sys.sleep(2) 
  'Done waiting!'
}
# Run in serial
tic()
map_chr(1:4, \(i) wait_2_seconds())
[1] "Done waiting!" "Done waiting!" "Done waiting!" "Done waiting!"
toc()
8.017 sec elapsed
# Run in parallel
plan(multisession, workers = 4)
tic()
future_map_chr(1:4, \(i) wait_2_seconds())
[1] "Done waiting!" "Done waiting!" "Done waiting!" "Done waiting!"
toc()
3.208 sec elapsed

Recall: MLE Bias Simulation

set.seed(1066)

draw_sim_data <- \(n, s_sq) rnorm(n, sd = sqrt(s_sq))

get_estimates <- \(x) c('usual' = var(x), 'MLE' = mean((x - mean(x))^2)) 

run_sim <- \(n, s_sq, nreps = 5000) { 
  map(1:nreps, \(i) draw_sim_data(n, s_sq)) |> 
    map_dfr(get_estimates)
}

get_summary_stats <- function(df) {
  c('usual_mean' = mean(df$usual), 'MLE_mean' = mean(df$MLE), 
    'usual_var' = var(df$usual),'MLE_var' = var(df$MLE))
}

sim_params <- expand_grid(n = 3:5, s_sq = 1:3)
sim_results <- pmap(sim_params, run_sim)
summary_stats <- map_dfr(sim_results, get_summary_stats)
summary_stats <- bind_cols(sim_params, summary_stats)

summary_stats
# A tibble: 9 × 6
      n  s_sq usual_mean MLE_mean usual_var MLE_var
  <int> <int>      <dbl>    <dbl>     <dbl>   <dbl>
1     3     1      1.01     0.677     1.05    0.468
2     3     2      2.00     1.33      4.08    1.81 
3     3     3      3.01     2.01      9.17    4.08 
4     4     1      0.989    0.742     0.639   0.359
5     4     2      2.00     1.50      2.61    1.47 
6     4     3      2.99     2.25      6.34    3.57 
7     5     1      0.991    0.793     0.503   0.322
8     5     2      2.02     1.61      2.10    1.34 
9     5     3      3.02     2.41      4.56    2.92 
flowchart TD
  A[sim_params] -->|"Iterate over each param combo with pmap()"| B["run_sim(n, s_sq)"]
  B -->|"Repeat nreps times with map()"| C["draw_sim_data(n, s_sq)"]
  C -->|"Iterate with map_dfr()"| D["get_estimates()"]
  D -->|tibble of nreps Usual and MLE estimates| B
  B -->|List of sim results for all param combos| E[sim_results]
E -->|"Iterate over list with map_dfr()"| F["get_summary_stats()"]
  F -->|tibble of means / vars for Usual and MLE| G[summary_stats]
  G -->|Column bind stats and params| H[Final results]
  %% Size of final output = size of sim_params
  H -.->|#rows matches sim_params| A
flowchart TD
  A[sim_params] -->|Distribute params: n, s_sq to Workers| B["future_pmap()"]
  B -->|Worker 1: n_1, s_sq_1| C["run_sim(n_1, s_sq_1)"]
  B -->|Worker 2: n_2, s_sq_2| D["run_sim(n_2, s_sq_2)"]
  B -->|Worker 3: n_3, s_sq_3| E["run_sim(n_3, s_sq_3)"]
  B -->|...| Z["run_sim(n_N, s_sq_N)"]
  C -->|Collect sim results| I[sim_results]
  D -->|Collect sim results| I
  E -->|Collect sim results| I
  Z -->|Collect sim results| I
  I -->|"Iterate over list with map_dfr()"| F["get_summary_stats()"]
  F -->|tibble of means / vars for Usual and MLE| G[summary_stats]
  G -->|Column bind stats and parameters| H[Final results]

set.seed() – Serial versus Parallel

  • The Mersenne Twister can only run in serial
  • Parallel simulations require parallel pseudo-RNG
  • Such things exist, but use different algorithms.
  • Can still set the seed, but serial and parallel will not agree
  • More discussion here

Timing Comparison: Serial vs 4 Workers

sim_params <- expand_grid(n = 3:12, s_sq = 1:10)
tic()
sim_results <- pmap(sim_params, run_sim)
toc()
18.275 sec elapsed
plan(multisession, workers = 4)
set.seed(1066)
my_options <- furrr_options(seed = TRUE)
tic()
sim_results <- future_pmap(sim_params, run_sim, .options = my_options)
toc()
6.511 sec elapsed

💪 Exercise C - (\(\infty\) minutes)

  1. Use rmvnorm() to write a function that generates n draws from a bivariate standard normal distribution with correlation coefficient r. Check you work by generating a large number of simulations and calculating the sample variance-covariance matrix.

  2. The function cov() calculates the sample covariance between \(X\) and \(Y\) as \(S_{xy} = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X}) (Y_i - \bar{Y})\). In contrast, the MLE \(\widehat{\sigma}_{xy}\) for jointly normal \((X_i, Y_i)\) divides by \(n\) rather than \((n - 1)\). Write a function that takes a matrix with two columns and n rows as its input and calculates \(\widehat{\sigma}_{xy}\).

  3. Use the functions you wrote in the preceding two parts to carry out a simulation study investigating the bias of \(\widehat{\sigma}_{xy}\). Use 5000 replications and a parameter grid of n\(\in \{5, 10, 15, 20, 25\}\), `r``\(\in \{-0.5, 0.25, 0, 0.25, 0.5\}\). Try to run it in parallel. Summarize your findings.