Returns tibble with same number of rows as gradebook
Retains students whose id appears in gradebook but not email.
Drops students whose id appears in emails but not gradebook.
Unknown email addresses become NA.
💪 Exercise A - (10 min)
Answer the following, consulting the dplyr help files as needed.
Run right_join(gradebook, emails). What happens? Explain.
Run full_join(gradebook, emails). What happens? Explain.
Run inner_join(gradebook, emails). What happens? Explain.
Above I ran left_join(gradebook, emails). How could I have used the pipe?
Add a column called name to the emails tibble, containing the following names in order: c('Joe', 'Alice', 'Ethelburga', 'Mark', 'Bob'). Then use a left join to merge gradebook with emails. What happens? Now try setting the parameter by = 'student_id'. What changes?
Reshaping Data
Pivoting: From Wide to Long and Back
Sometimes need to reshape data before plotting or analysis.
Load Balancing: if a core has less work, it finishes early
Communication overhead, hard/software limitations
future and furrr
future provides asynchronous evaluation of R expressions:
multisession: background R sessions on current machine
multicore: forked R processes on current machine
cluster: external R sessions on current/ local / remote machines
furrr uses future to run purrr commands in parallel.
Simply prefix purrr functions with future_
E.g. future_map() or future_pmap()
The Simplest Example of furrr
library(tictoc) # for easy timing of code chunkslibrary(furrr)wait_2_seconds <-function() {Sys.sleep(2) 'Done waiting!'}# Run in serialtic()map_chr(1:4, \(i) wait_2_seconds())
flowchart TD
A[sim_params] -->|"Iterate over each param combo with pmap()"| B["run_sim(n, s_sq)"]
B -->|"Repeat nreps times with map()"| C["draw_sim_data(n, s_sq)"]
C -->|"Iterate with map_dfr()"| D["get_estimates()"]
D -->|tibble of nreps Usual and MLE estimates| B
B -->|List of sim results for all param combos| E[sim_results]
E -->|"Iterate over list with map_dfr()"| F["get_summary_stats()"]
F -->|tibble of means / vars for Usual and MLE| G[summary_stats]
G -->|Column bind stats and params| H[Final results]
%% Size of final output = size of sim_params
H -.->|#rows matches sim_params| A
flowchart TD
A[sim_params] -->|Distribute params: n, s_sq to Workers| B["future_pmap()"]
B -->|Worker 1: n_1, s_sq_1| C["run_sim(n_1, s_sq_1)"]
B -->|Worker 2: n_2, s_sq_2| D["run_sim(n_2, s_sq_2)"]
B -->|Worker 3: n_3, s_sq_3| E["run_sim(n_3, s_sq_3)"]
B -->|...| Z["run_sim(n_N, s_sq_N)"]
C -->|Collect sim results| I[sim_results]
D -->|Collect sim results| I
E -->|Collect sim results| I
Z -->|Collect sim results| I
I -->|"Iterate over list with map_dfr()"| F["get_summary_stats()"]
F -->|tibble of means / vars for Usual and MLE| G[summary_stats]
G -->|Column bind stats and parameters| H[Final results]
set.seed() – Serial versus Parallel
The Mersenne Twister can only run in serial
Parallel simulations require parallel pseudo-RNG
Such things exist, but use different algorithms.
Can still set the seed, but serial and parallel will not agree
Use rmvnorm() to write a function that generates n draws from a bivariate standard normal distribution with correlation coefficient r. Check you work by generating a large number of simulations and calculating the sample variance-covariance matrix.
The function cov() calculates the sample covariance between \(X\) and \(Y\) as \(S_{xy} = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X}) (Y_i - \bar{Y})\). In contrast, the MLE \(\widehat{\sigma}_{xy}\) for jointly normal \((X_i, Y_i)\) divides by \(n\) rather than \((n - 1)\). Write a function that takes a matrix with two columns and n rows as its input and calculates \(\widehat{\sigma}_{xy}\).
Use the functions you wrote in the preceding two parts to carry out a simulation study investigating the bias of \(\widehat{\sigma}_{xy}\). Use 5000 replications and a parameter grid of n\(\in \{5, 10, 15, 20, 25\}\), `r``\(\in \{-0.5, 0.25, 0, 0.25, 0.5\}\). Try to run it in parallel. Summarize your findings.