The Whole Kit & Caboodle
University of Oxford
purrr
purr::map()
in an earlier lecture. More today!purrr::map(.x, .f)
\[
\texttt{map}(\mathbf{x}, f) = \texttt{map}\left(\begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix},\, f\right) = \begin{bmatrix} f(x_1) \\ f(x_2) \\ \vdots \\ f(x_n) \end{bmatrix}
\]map(.x, .f)
replaces for()
\[ \texttt{map}(\mathbf{x}, f) = \texttt{map}\left(\begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix},\, f\right) = \begin{bmatrix} f(x_1) \\ f(x_2) \\ \vdots \\ f(x_n) \end{bmatrix} \]
.x
is a list or vector; .f
is a functionmap(x, f)
gives same output as this for()
loop:purrr:map()
Examplesum_prod <- \(v) {
# Return the sum and product of a vector v as a named vector
c('sum' = sum(v), 'product' = prod(v))
}
x <- list(c(1, 1, 1), c(1, 2, 3), c(3, 3, 3))
library(tidyverse) # includes purrr
map(x, sum_prod)
[[1]]
sum product
3 1
[[2]]
sum product
6 6
[[3]]
sum product
9 27
Nice! But we may want to output something besides a list…
map_dfc()
and map_dfr()
map_dfc()
binds columns (c
), returns a data frame (df
)# A tibble: 2 × 3
...1 ...2 ...3
<dbl> <dbl> <dbl>
1 3 6 9
2 1 6 27
map_dfr()
binds rows (r
), returns a data frame (df
)map()
Here 1:5
is interpreted as though it were a list:
Same works with map_dfc()
, map_dfr()
etc.
map_dbl(.x, .f)
and friends.f
returns a scalar.x
map_dbl()
returns a double vectormap_chr()
returns a character vectormap_int()
returns an integer vectormap_lgl()
returns a logical vectormap_xxx()
Examples [1] 1 4 9 16 25 36 49 64 81 100
[1] 2 1 1 1 0
[1] "core" "erm" "rules"
[1] FALSE TRUE FALSE FALSE
Note
map()
isn’t really needed in these examples, since we could just used vectorized operations. They’re only for illustration!
purrr::map2(.x, .y, .f)
\[ \texttt{map2}\left(\begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix},\, \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}, \, f\right) = \begin{bmatrix} f(x_1, y_1) \\ f(x_2, y_2) \\ \vdots \\ f(x_n, y_n) \end{bmatrix} \]
.x
and .y
are vectors or lists of the same length.f
is a function that takes two argumentsmap2_dfr()
, map2_dfc()
, map2_dbl()
are analogous to map()
pmap(.l, .f)
\[ \texttt{pmap}\left(\begin{bmatrix} l_{11} & l_{12} & \cdots & l_{1r} \\ l_{21} & l_{22} & \cdots & l_{2r}\\ \vdots \\ l_{n1} & l_{n2} & \cdots & l_{nr} \end{bmatrix}, \, f\right) = \begin{bmatrix} f(l_{11}, l_{12}, \dots, l_{1r}) \\ f(l_{21}, l_{22}, \dots , l_{2r}) \\ \vdots \\ f(l_{n1}, l_{n2}, \dots, l_{nr}) \end{bmatrix} \]
.l
is a list of lists or a list of vectors (data frame).f
is a function that takes \(r\) argumentspmap_dfr()
, pmap_dfc()
, pmap_dbl()
are analogous to map()
map2()
and pmap()
ExamplesArgument names are arbitrary:
[1] 6 49 512 6561 100000
# A tibble: 5 × 3
col1 col2 col3
<int> <int> <int>
1 1 5 -2
2 2 4 -1
3 3 3 0
4 4 2 1
5 5 1 2
Argument names need to match df
Recall this code snippet from our earlier lecture on simulation:
purrr
to explain line 7.for
loop to replace line 7.n
and s_sq
to varyNote
You don’t necessarily have to return a vector: you could just as well return a list.
possibly()
possibly()
a function to fail gracefully, suppressing errors and storing a designated value in case of failure:# Generate list of three simulation datasets as an example
set.seed(1693)
sim_datasets <- map(1:3, \(i) draw_sim_data(n = 5, s_sq = 1))
sim_datasets
[[1]]
[1] -0.5070343 -0.8750463 -0.1883818 -1.1772701 1.8960400
[[2]]
[1] 0.3824529 1.0291447 -1.3998332 1.3983002 -0.3950949
[[3]]
[1] 0.1204153 -0.9611377 -0.3870477 0.7757843 0.6520975
# A tibble: 3 × 2
usual MLE
<dbl> <dbl>
1 1.47 1.18
2 1.27 1.01
3 0.527 0.421
Now that we have the idea, do more replications!
set.seed(1693)
nreps <- 5000
sim_datasets <- map(1:nreps, \(i) draw_sim_data(n = 5, s_sq = 1))
sim_estimates <- map_dfr(sim_datasets, get_estimates)
sim_estimates
# A tibble: 5,000 × 2
usual MLE
<dbl> <dbl>
1 1.47 1.18
2 1.27 1.01
3 0.527 0.421
4 0.858 0.686
5 0.863 0.690
6 0.942 0.754
7 0.521 0.417
8 2.84 2.27
9 0.941 0.753
10 0.775 0.620
# ℹ 4,990 more rows
Using sim_estimates
we can approximate the bias of \(\widehat{\sigma}^2\)
# Sim parameters: n = 5, s_sq = 1,
n <- 5
s_sq <- 1
sim_estimates |>
summarize(bias_true = -s_sq / n, bias_sim = mean(MLE - s_sq))
# A tibble: 1 × 2
bias_true bias_sim
<dbl> <dbl>
1 -0.2 -0.197
Great: our simulation appears to work!
Wrap step 3 from above into a function:
run_sim <- \(n, s_sq, nreps = 5000) { # default: nreps = 5000
map(1:nreps, \(i) draw_sim_data(n, s_sq)) |>
map_dfr(get_estimates)
}
set.seed(1693)
run_sim(n = 5, s_sq = 1)
# A tibble: 5,000 × 2
usual MLE
<dbl> <dbl>
1 1.47 1.18
2 1.27 1.01
3 0.527 0.421
4 0.858 0.686
5 0.863 0.690
6 0.942 0.754
7 0.521 0.417
8 2.84 2.27
9 0.941 0.753
10 0.775 0.620
# ℹ 4,990 more rows
Use expand_grid()
from tidyr
to set up parameter grid
Use purrr::pmap()
to call run_sim()
on each row of sim_params
[[1]]
# A tibble: 5,000 × 2
usual MLE
<dbl> <dbl>
1 0.370 0.247
2 0.373 0.249
3 0.135 0.0903
4 2.95 1.97
5 0.0589 0.0393
6 0.975 0.650
7 1.63 1.09
8 1.30 0.865
9 1.53 1.02
10 0.823 0.548
# ℹ 4,990 more rows
[[2]]
# A tibble: 5,000 × 2
usual MLE
<dbl> <dbl>
1 0.551 0.368
2 0.513 0.342
3 0.478 0.319
4 1.54 1.03
5 0.410 0.274
6 2.15 1.43
7 0.447 0.298
8 0.145 0.0969
9 0.455 0.303
10 0.466 0.311
# ℹ 4,990 more rows
[[3]]
# A tibble: 5,000 × 2
usual MLE
<dbl> <dbl>
1 2.31 1.54
2 0.460 0.306
3 0.598 0.399
4 2.32 1.55
5 0.0177 0.0118
6 1.45 0.965
7 6.55 4.37
8 2.81 1.87
9 0.261 0.174
10 2.58 1.72
# ℹ 4,990 more rows
[[4]]
# A tibble: 5,000 × 2
usual MLE
<dbl> <dbl>
1 1.37 1.03
2 0.790 0.592
3 0.507 0.380
4 0.859 0.644
5 0.584 0.438
6 1.42 1.06
7 6.23 4.67
8 1.24 0.932
9 0.630 0.472
10 0.123 0.0924
# ℹ 4,990 more rows
[[5]]
# A tibble: 5,000 × 2
usual MLE
<dbl> <dbl>
1 2.99 2.24
2 0.845 0.633
3 0.635 0.476
4 1.59 1.20
5 3.36 2.52
6 1.64 1.23
7 0.685 0.514
8 9.98 7.49
9 1.95 1.46
10 1.25 0.934
# ℹ 4,990 more rows
[[6]]
# A tibble: 5,000 × 2
usual MLE
<dbl> <dbl>
1 5.39 4.04
2 1.98 1.48
3 3.48 2.61
4 5.56 4.17
5 3.45 2.59
6 3.66 2.75
7 1.36 1.02
8 0.597 0.448
9 2.49 1.87
10 9.44 7.08
# ℹ 4,990 more rows
[[7]]
# A tibble: 5,000 × 2
usual MLE
<dbl> <dbl>
1 0.649 0.519
2 0.0978 0.0782
3 0.988 0.791
4 0.944 0.755
5 0.661 0.528
6 1.37 1.09
7 1.17 0.938
8 1.32 1.05
9 0.998 0.798
10 2.02 1.61
# ℹ 4,990 more rows
[[8]]
# A tibble: 5,000 × 2
usual MLE
<dbl> <dbl>
1 2.25 1.80
2 0.337 0.270
3 1.69 1.35
4 2.05 1.64
5 1.48 1.18
6 0.946 0.757
7 4.99 3.99
8 0.865 0.692
9 0.635 0.508
10 3.32 2.66
# ℹ 4,990 more rows
[[9]]
# A tibble: 5,000 × 2
usual MLE
<dbl> <dbl>
1 0.637 0.510
2 3.90 3.12
3 2.11 1.69
4 5.30 4.24
5 2.28 1.82
6 2.07 1.65
7 3.19 2.56
8 0.742 0.594
9 6.42 5.13
10 2.55 2.04
# ℹ 4,990 more rows
get_summary_stats <- function(df) {
c('usual_mean' = mean(df$usual),
'MLE_mean' = mean(df$MLE),
'usual_var' = var(df$usual),
'MLE_var' = var(df$MLE))
}
summary_stats <- map_dfr(sim_results, get_summary_stats)
summary_stats <- bind_cols(sim_params, summary_stats)
summary_stats
# A tibble: 9 × 6
n s_sq usual_mean MLE_mean usual_var MLE_var
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 3 1 0.990 0.660 0.966 0.429
2 3 2 1.96 1.31 3.82 1.70
3 3 3 2.93 1.95 8.58 3.81
4 4 1 0.995 0.746 0.688 0.387
5 4 2 2.00 1.50 2.60 1.46
6 4 3 2.95 2.21 5.80 3.26
7 5 1 0.997 0.797 0.484 0.310
8 5 2 2.02 1.62 2.09 1.34
9 5 3 3.03 2.42 4.71 3.01
summary_stats |>
mutate(MLE_bias = -s_sq / n,
MLE_sim_bias = MLE_mean - s_sq,
usual_sim_bias = usual_mean - s_sq,
MLE_rmse = sqrt(MLE_sim_bias^2 + MLE_var),
usual_rmse = sqrt(usual_sim_bias^2 + usual_var)) |>
select(MLE_bias, MLE_sim_bias, MLE_rmse, usual_rmse)
# A tibble: 9 × 4
MLE_bias MLE_sim_bias MLE_rmse usual_rmse
<dbl> <dbl> <dbl> <dbl>
1 -0.333 -0.340 0.738 0.983
2 -0.667 -0.691 1.48 1.96
3 -1 -1.05 2.22 2.93
4 -0.25 -0.254 0.672 0.829
5 -0.5 -0.498 1.31 1.61
6 -0.75 -0.789 1.97 2.41
7 -0.2 -0.203 0.592 0.696
8 -0.4 -0.383 1.22 1.45
9 -0.6 -0.578 1.83 2.17
draw_sim_data()
and get_estimates()
within draw_sim_data()
Amos [Tversky] and his students Tom Gilovich and Robert Vallone caused a stir with their study of misperceptions of randomness in basketball. The “fact” that players occasionally acquire a hot hand is generally accepted by players, coaches, and fans. The inference is irresistible: a player sinks three or four baskets in a row and you cannot help forming the causal judgment that this player is now hot, with a temporarily increased propensity to score. Players on both teams adapt to this judgment—teammates are more likely to pass to the hot scorer and the defense is more likely to doubleteam.
Analysis of thousands of sequences of shots led to a disappointing conclusion: there is no such thing as a hot hand in professional basketball, either in shooting from the field or scoring from the foul line. Of course, some players are more accurate than others, but the sequence of successes and missed shots satisfies all tests of randomness. The hot hand is entirely in the eye of the beholders, who are consistently too quick to perceive order and causality in randomness. The hot hand is a massive and widespread cognitive illusion.
Simulation is more powerful than derivation. Don’t be clever; be right.
shots
is a vector of zeros and ones: shots made by a basketball player1
becomes more likely after a “streak” of 1
s.shots
: sequence of k
ones in a rowafter_shots
.mean(after_streak)
\(>\) mean(shots)
mean(after_streak)
\(\approx\) mean(shots)
Note
For a more readable explanation of the Hot Hand bias, see Miller & Sanjurjo (2019).