# A tibble: 18 × 4
student_id name quiz score
<dbl> <chr> <dbl> <dbl>
1 192297 Alice 1 64
2 192297 Alice 2 96
3 192297 Alice 3 68
4 291857 Bob 1 58
5 291857 Bob 2 91
6 291857 Bob 3 91
7 500286 Charlotte 1 70
8 500286 Charlotte 2 94
9 500286 Charlotte 3 71
10 449192 Dante 1 57
11 449192 Dante 2 85
12 449192 Dante 3 84
13 372152 Ethelburga 1 74
14 372152 Ethelburga 2 91
15 372152 Ethelburga 3 70
16 627561 Felix 1 77
17 627561 Felix 2 86
18 627561 Felix 3 68
Exercise A - (10 min)
Answer the following, consulting the dplyr help files as needed.
Run right_join(gradebook, emails). What happens? Explain.
Run full_join(gradebook, emails). What happens? Explain.
Run inner_join(gradebook, emails). What happens? Explain.
Above I ran left_join(gradebook, emails). How could I have used the pipe?
Add a column called name to the emails tibble, containing the following names in order: c('Joe', 'Alice', 'Ethelburga', 'Mark', 'Bob'). Then use a left join to merge gradebook with emails. What happens? Now try setting the parameter by = 'student_id'. What changes?
Solution
# Part 1# The result contains students whose ids are in emails. Those with ids# in gradebook who are *not* in gradebook are dropped.right_join(gradebook, emails)
Joining with `by = join_by(student_id)`
# A tibble: 5 × 9
student_id name quiz1 quiz2 quiz3 midterm1 midterm2 final email
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 192297 Alice 64 96 68 81 90 99 alice.liddell…
2 291857 Bob 58 91 91 75 75 79 microsoftbob@…
3 372152 Ethelburga 74 91 70 63 73 96 ethelburga@ly…
4 101198 <NA> NA NA NA NA NA NA unclejoe@whit…
5 918276 <NA> NA NA NA NA NA NA mzuckerberg@g…
# Part 2# The result contains everyone whose id appears in *either* dataset. This# requires lots of padding out with missing values.full_join(gradebook, emails)
Joining with `by = join_by(student_id)`
# A tibble: 8 × 9
student_id name quiz1 quiz2 quiz3 midterm1 midterm2 final email
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 192297 Alice 64 96 68 81 90 99 alice.liddell…
2 291857 Bob 58 91 91 75 75 79 microsoftbob@…
3 500286 Charlotte 70 94 71 81 70 74 <NA>
4 449192 Dante 57 85 84 83 94 83 <NA>
5 372152 Ethelburga 74 91 70 63 73 96 ethelburga@ly…
6 627561 Felix 77 86 68 78 83 75 <NA>
7 101198 <NA> NA NA NA NA NA NA unclejoe@whit…
8 918276 <NA> NA NA NA NA NA NA mzuckerberg@g…
# Part 3# The result contains only those whose id appears in *both* datasets. Everyone# else is dropped.inner_join(gradebook, emails)
# Part 5drop1_avg <-function(x){# Calculate the mean of x dropping the lowest value x <-sort(x)mean(x[-1]) }gradebook |>pivot_longer(starts_with('quiz'), names_to ='quiz', values_to ='score') |>group_by(name) |>mutate(quiz_avg =drop1_avg(score)) |>pivot_wider(names_from ='quiz', values_from ='score')
Use rmvnorm() to write a function that generates n draws from a bivariate standard normal distribution with correlation coefficient r. Check you work by generating a large number of simulations and calculating the sample variance-covariance matrix.
The function cov() calculates the sample covariance between \(X\) and \(Y\) as \(S_{xy} = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X}) (Y_i - \bar{Y})\). In contrast, the MLE \(\widehat{\sigma}_{xy}\) for jointly normal \((X_i, Y_i)\) divides by \(n\) rather than \((n - 1)\). Write a function that takes a matrix with two columns and n rows as its input and calculates \(\widehat{\sigma}_{xy}\).
Use the functions you wrote in the preceding two parts to carry out a simulation study investigating the bias of \(\widehat{\sigma}_{xy}\). Use 5000 replications and a parameter grid of n\(\in \{5, 10, 15, 20, 25\}\), r\(\in \{-0.5, 0.25, 0, 0.25, 0.5\}\). Try to run it in parallel. Summarize your findings.