The Whole Kit & Caboodle
University of Oxford
purrr
purr::map()
in an earlier lecture. More today!purrr::map(.x, .f)
\[
\texttt{map}(\mathbf{x}, f) = \texttt{map}\left(\begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix},\, f\right) = \begin{bmatrix} f(x_1) \\ f(x_2) \\ \vdots \\ f(x_n) \end{bmatrix}
\]map(.x, .f)
replaces for()
\[ \texttt{map}(\mathbf{x}, f) = \texttt{map}\left(\begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix},\, f\right) = \begin{bmatrix} f(x_1) \\ f(x_2) \\ \vdots \\ f(x_n) \end{bmatrix} \]
.x
is a list or vector; .f
is a functionmap(x, f)
gives same output as this for()
loop:purrr:map()
Examplesum_prod <- \(v) {
# Return the sum and product of a vector v as a named vector
c('sum' = sum(v), 'product' = prod(v))
}
x <- list(c(1, 1, 1), c(1, 2, 3), c(3, 3, 3))
library(tidyverse) # includes purrr
map(x, sum_prod)
[[1]]
sum product
3 1
[[2]]
sum product
6 6
[[3]]
sum product
9 27
Nice! But we may want to output something besides a list…
map_dfc()
and map_dfr()
map_dfc()
binds columns (c
), returns a data frame (df
)# A tibble: 2 × 3
...1 ...2 ...3
<dbl> <dbl> <dbl>
1 3 6 9
2 1 6 27
map_dfr()
binds rows (r
), returns a data frame (df
)map()
Here 1:5
is interpreted as though it were a list:
Same works with map_dfc()
, map_dfr()
etc.
map_dbl(.x, .f)
and friends.f
returns a scalar.x
map_dbl()
returns a double vectormap_chr()
returns a character vectormap_int()
returns an integer vectormap_lgl()
returns a logical vectormap_xxx()
Examplespurrr::map2(.x, .y, .f)
\[ \texttt{map2}\left(\begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix},\, \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}, \, f\right) = \begin{bmatrix} f(x_1, y_1) \\ f(x_2, y_2) \\ \vdots \\ f(x_n, y_n) \end{bmatrix} \]
.x
and .y
are vectors or lists of the same length.f
is a function that takes two argumentsmap2_dfr()
, map2_dfc()
, map2_dbl()
are analogous to map()
pmap(.l, .f)
\[ \texttt{pmap}\left(\begin{bmatrix} l_{11} & l_{12} & \cdots & l_{1r} \\ l_{21} & l_{22} & \cdots & l_{2r}\\ \vdots \\ l_{n1} & l_{n2} & \cdots & l_{nr} \end{bmatrix}, \, f\right) = \begin{bmatrix} f(l_{11}, l_{12}, \dots, l_{1r}) \\ f(l_{21}, l_{22}, \dots , l_{2r}) \\ \vdots \\ f(l_{n1}, l_{n2}, \dots, l_{nr}) \end{bmatrix} \]
.l
is a list of lists or a list of vectors (data frame).f
is a function that takes \(r\) argumentspmap_dfr()
, pmap_dfc()
, pmap_dbl()
are analogous to map()
map2()
and pmap()
ExamplesArgument names are arbitrary:
[1] 6 49 512 6561 100000
# A tibble: 5 × 3
col1 col2 col3
<int> <int> <int>
1 1 5 -2
2 2 4 -1
3 3 3 0
4 4 2 1
5 5 1 2
Argument names need to match df
Recall this code snippet from our earlier lecture on simulation:
purrr
to explain line 7.for
loop to replace line 7.n
and s_sq
to varypossibly()
possibly()
a function to fail gracefully, suppressing errors and storing a designated value in case of failure:# Generate list of three simulation datasets as an example
set.seed(1693)
sim_datasets <- map(1:3, \(i) draw_sim_data(n = 5, s_sq = 1))
sim_datasets
[[1]]
[1] -0.5070343 -0.8750463 -0.1883818 -1.1772701 1.8960400
[[2]]
[1] 0.3824529 1.0291447 -1.3998332 1.3983002 -0.3950949
[[3]]
[1] 0.1204153 -0.9611377 -0.3870477 0.7757843 0.6520975
# A tibble: 3 × 2
usual MLE
<dbl> <dbl>
1 1.47 1.18
2 1.27 1.01
3 0.527 0.421
Now that we have the idea, do more replications!
set.seed(1693)
nreps <- 5000
sim_datasets <- map(1:nreps, \(i) draw_sim_data(n = 5, s_sq = 1))
sim_estimates <- map_dfr(sim_datasets, get_estimates)
sim_estimates
# A tibble: 5,000 × 2
usual MLE
<dbl> <dbl>
1 1.47 1.18
2 1.27 1.01
3 0.527 0.421
4 0.858 0.686
5 0.863 0.690
6 0.942 0.754
7 0.521 0.417
8 2.84 2.27
9 0.941 0.753
10 0.775 0.620
# ℹ 4,990 more rows
Using sim_estimates
we can approximate the bias of \(\widehat{\sigma}^2\)
# Sim parameters: n = 5, s_sq = 1,
n <- 5
s_sq <- 1
sim_estimates |>
summarize(bias_true = -s_sq / n, bias_sim = mean(MLE - s_sq))
# A tibble: 1 × 2
bias_true bias_sim
<dbl> <dbl>
1 -0.2 -0.197
Great: our simulation appears to work!
Wrap step 3 from above into a function:
run_sim <- \(n, s_sq, nreps = 5000) { # default: nreps = 5000
map(1:nreps, \(i) draw_sim_data(n, s_sq)) |>
map_dfr(get_estimates)
}
set.seed(1693)
run_sim(n = 5, s_sq = 1)
# A tibble: 5,000 × 2
usual MLE
<dbl> <dbl>
1 1.47 1.18
2 1.27 1.01
3 0.527 0.421
4 0.858 0.686
5 0.863 0.690
6 0.942 0.754
7 0.521 0.417
8 2.84 2.27
9 0.941 0.753
10 0.775 0.620
# ℹ 4,990 more rows
Use expand_grid()
from tidyr
to set up parameter grid
Use purrr::pmap()
to call run_sim()
on each row of sim_params
[[1]]
# A tibble: 5,000 × 2
usual MLE
<dbl> <dbl>
1 0.370 0.247
2 0.373 0.249
3 0.135 0.0903
4 2.95 1.97
5 0.0589 0.0393
6 0.975 0.650
7 1.63 1.09
8 1.30 0.865
9 1.53 1.02
10 0.823 0.548
# ℹ 4,990 more rows
[[2]]
# A tibble: 5,000 × 2
usual MLE
<dbl> <dbl>
1 0.551 0.368
2 0.513 0.342
3 0.478 0.319
4 1.54 1.03
5 0.410 0.274
6 2.15 1.43
7 0.447 0.298
8 0.145 0.0969
9 0.455 0.303
10 0.466 0.311
# ℹ 4,990 more rows
[[3]]
# A tibble: 5,000 × 2
usual MLE
<dbl> <dbl>
1 2.31 1.54
2 0.460 0.306
3 0.598 0.399
4 2.32 1.55
5 0.0177 0.0118
6 1.45 0.965
7 6.55 4.37
8 2.81 1.87
9 0.261 0.174
10 2.58 1.72
# ℹ 4,990 more rows
[[4]]
# A tibble: 5,000 × 2
usual MLE
<dbl> <dbl>
1 1.37 1.03
2 0.790 0.592
3 0.507 0.380
4 0.859 0.644
5 0.584 0.438
6 1.42 1.06
7 6.23 4.67
8 1.24 0.932
9 0.630 0.472
10 0.123 0.0924
# ℹ 4,990 more rows
[[5]]
# A tibble: 5,000 × 2
usual MLE
<dbl> <dbl>
1 2.99 2.24
2 0.845 0.633
3 0.635 0.476
4 1.59 1.20
5 3.36 2.52
6 1.64 1.23
7 0.685 0.514
8 9.98 7.49
9 1.95 1.46
10 1.25 0.934
# ℹ 4,990 more rows
[[6]]
# A tibble: 5,000 × 2
usual MLE
<dbl> <dbl>
1 5.39 4.04
2 1.98 1.48
3 3.48 2.61
4 5.56 4.17
5 3.45 2.59
6 3.66 2.75
7 1.36 1.02
8 0.597 0.448
9 2.49 1.87
10 9.44 7.08
# ℹ 4,990 more rows
[[7]]
# A tibble: 5,000 × 2
usual MLE
<dbl> <dbl>
1 0.649 0.519
2 0.0978 0.0782
3 0.988 0.791
4 0.944 0.755
5 0.661 0.528
6 1.37 1.09
7 1.17 0.938
8 1.32 1.05
9 0.998 0.798
10 2.02 1.61
# ℹ 4,990 more rows
[[8]]
# A tibble: 5,000 × 2
usual MLE
<dbl> <dbl>
1 2.25 1.80
2 0.337 0.270
3 1.69 1.35
4 2.05 1.64
5 1.48 1.18
6 0.946 0.757
7 4.99 3.99
8 0.865 0.692
9 0.635 0.508
10 3.32 2.66
# ℹ 4,990 more rows
[[9]]
# A tibble: 5,000 × 2
usual MLE
<dbl> <dbl>
1 0.637 0.510
2 3.90 3.12
3 2.11 1.69
4 5.30 4.24
5 2.28 1.82
6 2.07 1.65
7 3.19 2.56
8 0.742 0.594
9 6.42 5.13
10 2.55 2.04
# ℹ 4,990 more rows
get_summary_stats <- function(df) {
c('usual_mean' = mean(df$usual),
'MLE_mean' = mean(df$MLE),
'usual_var' = var(df$usual),
'MLE_var' = var(df$MLE))
}
summary_stats <- map_dfr(sim_results, get_summary_stats)
summary_stats <- bind_cols(sim_params, summary_stats)
summary_stats
# A tibble: 9 × 6
n s_sq usual_mean MLE_mean usual_var MLE_var
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 3 1 0.990 0.660 0.966 0.429
2 3 2 1.96 1.31 3.82 1.70
3 3 3 2.93 1.95 8.58 3.81
4 4 1 0.995 0.746 0.688 0.387
5 4 2 2.00 1.50 2.60 1.46
6 4 3 2.95 2.21 5.80 3.26
7 5 1 0.997 0.797 0.484 0.310
8 5 2 2.02 1.62 2.09 1.34
9 5 3 3.03 2.42 4.71 3.01
summary_stats |>
mutate(MLE_bias = -s_sq / n,
MLE_sim_bias = MLE_mean - s_sq,
usual_sim_bias = usual_mean - s_sq,
MLE_rmse = sqrt(MLE_sim_bias^2 + MLE_var),
usual_rmse = sqrt(usual_sim_bias^2 + usual_var)) |>
select(MLE_bias, MLE_sim_bias, MLE_rmse, usual_rmse)
# A tibble: 9 × 4
MLE_bias MLE_sim_bias MLE_rmse usual_rmse
<dbl> <dbl> <dbl> <dbl>
1 -0.333 -0.340 0.738 0.983
2 -0.667 -0.691 1.48 1.96
3 -1 -1.05 2.22 2.93
4 -0.25 -0.254 0.672 0.829
5 -0.5 -0.498 1.31 1.61
6 -0.75 -0.789 1.97 2.41
7 -0.2 -0.203 0.592 0.696
8 -0.4 -0.383 1.22 1.45
9 -0.6 -0.578 1.83 2.17
draw_sim_data()
and get_estimates()
within draw_sim_data()
Amos [Tversky] and his students Tom Gilovich and Robert Vallone caused a stir with their study of misperceptions of randomness in basketball. The “fact” that players occasionally acquire a hot hand is generally accepted by players, coaches, and fans. The inference is irresistible: a player sinks three or four baskets in a row and you cannot help forming the causal judgment that this player is now hot, with a temporarily increased propensity to score. Players on both teams adapt to this judgment—teammates are more likely to pass to the hot scorer and the defense is more likely to doubleteam.
Analysis of thousands of sequences of shots led to a disappointing conclusion: there is no such thing as a hot hand in professional basketball, either in shooting from the field or scoring from the foul line. Of course, some players are more accurate than others, but the sequence of successes and missed shots satisfies all tests of randomness. The hot hand is entirely in the eye of the beholders, who are consistently too quick to perceive order and causality in randomness. The hot hand is a massive and widespread cognitive illusion.
shots
is a vector of zeros and ones: shots made by a basketball player1
becomes more likely after a “streak” of 1
s.shots
: sequence of k
ones in a rowafter_shots
.mean(after_streak)
\(>\) mean(shots)
mean(after_streak)
\(\approx\) mean(shots)
shots
from a known process (either with or without a hot hand)p
mean(after_streak)
to p
get_avg_after_streak()
. Given a vector shots
, it should construct after_streak
with \(k = 3\) and return mean(after_streak)
. Test your function using c(0, 1, 1, 1, 1, 0, 0, 0)
.draw_shots()
that simulates a sequence of n_shots
iid Bernoulli trials, each with probability of success prob_success
.get_avg_after_streak()
, where a streak is defined by \(k = 3\). Amos argues: “If Liz does not have a hot hand, then each shot is independent of the others with probability of success 1/2. This means the expected value of T will be 1/2.” Carry out a simulation with 10,000 replications to check Amos’ argument.n_shots
\(\in \{50, 100, 200\}\) and prob_success
\(\in \{0.4, 0.4, 0.6\}\). What do you conclude from your simulation?