Running a Simulation Study

The Whole Kit & Caboodle

Francis J. DiTraglia

University of Oxford

What is this?

Already had a lecture on Monte Carlo Simulation Basics
Recall the “General Recipe” for Monte Carlo:
1. Write a function to carry out the experiment once.
2. Use iteration to call that function a large number of times.
3. Store and summarize results; set seed for replicability.
Today: more elaborate examples that call for a more complicated recipe and more powerful tools.

🐱 More on `purrr`

We used purr::map() in an earlier lecture. More today!
References:

What is a functional?

A functional is a function that:
- takes another function as an input or
- returns another function as its output.
Example: purrr::map(.x, .f) \[ \texttt{map}(\mathbf{x}, f) = \texttt{map}\left(\begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix},\, f\right) = \begin{bmatrix} f(x_1) \\ f(x_2) \\ \vdots \\ f(x_n) \end{bmatrix} \]

`map(.x, .f)` replaces `for()`

\[ \texttt{map}(\mathbf{x}, f) = \texttt{map}\left(\begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix},\, f\right) = \begin{bmatrix} f(x_1) \\ f(x_2) \\ \vdots \\ f(x_n) \end{bmatrix} \]

.x is a list or vector; .f is a function
map(x, f) gives same output as this for() loop:

results <- vector('list', length(x)) # pre-allocate empty list
for (i in seq_along(x)) { # safer than 1:length(x)
  results[[i]] <- f(x[[i]]) # x is a list; items could be *anything* 
}

`purrr:map()` Example

sum_prod <- function(v) {
  # Return the sum and product of a vector v as a named vector
  c('sum' = sum(v), 'product' = prod(v))
}

x <- list(c(1, 1, 1), c(1, 2, 3), c(3, 3, 3))

library(tidyverse) # includes purrr
map(x, sum_prod)

[[1]]
    sum product 
      3       1 

[[2]]
    sum product 
      6       6 

[[3]]
    sum product 
      9      27

Nice! But we may want to output something besides a list…

`map_dfc()` and `map_dfr()`

map_dfc() binds columns (c), returns a data frame (df)

map_dfc(x, sum_prod) # looks for row names but there are none!

# A tibble: 2 × 3
   ...1  ...2  ...3
  <dbl> <dbl> <dbl>
1     3     6     9
2     1     6    27

map_dfr() binds rows (r), returns a data frame (df)

map_dfr(x, sum_prod) # preserves column names

# A tibble: 3 × 2
    sum product
  <dbl>   <dbl>
1     3       1
2     6       6
3     9      27

Supplying a vector to `map()`

Here 1:5 is interpreted as though it were a list:

map(1:4, \(x) x^2)

[[1]]
[1] 1

[[2]]
[1] 4

[[3]]
[1] 9

[[4]]
[1] 16

Same works with map_dfc(), map_dfr() etc.

`map_dbl(.x, .f)` and friends

Common special case:
- .f returns a scalar
- Output an atomic vector with same length as .x
Specify the type of the output:
- map_dbl() returns a double vector
- map_chr() returns a character vector
- map_int() returns an integer vector
- map_lgl() returns a logical vector

`map_xxx()` Examples

map_dbl(1:10, \(x) x^2)

 [1]   1   4   9  16  25  36  49  64  81 100

map_int(c(5L, 10L, 13L, 7L, 3L), \(x) x %% 3)

[1] 2 1 1 1 0

map_chr(c('CORE', 'ERM', 'RULES'), tolower)

[1] "core"  "erm"   "rules"

map_lgl(c(TRUE, FALSE, TRUE, TRUE), \(x) !x)

[1] FALSE  TRUE FALSE FALSE

Note

map() isn’t really needed in these examples, since we could just used vectorized operations. They’re only for illustration!

`purrr::map2(.x, .y, .f)`

\[ \texttt{map2}\left(\begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix},\, \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}, \, f\right) = \begin{bmatrix} f(x_1, y_1) \\ f(x_2, y_2) \\ \vdots \\ f(x_n, y_n) \end{bmatrix} \]

.x and .y are vectors or lists of the same length
.f is a function that takes two arguments
Returns a list by default; variants map2_dfr(), map2_dfc(), map2_dbl() are analogous to map()

Parallel map: `pmap(.l, .f)`

\[ \texttt{pmap}\left(\begin{bmatrix} l_{11} & l_{12} & \cdots & l_{1r} \\ l_{21} & l_{22} & \cdots & l_{2r}\\ \vdots \\ l_{n1} & l_{n2} & \cdots & l_{nr} \end{bmatrix}, \, f\right) = \begin{bmatrix} f(l_{11}, l_{12}, \dots, l_{1r}) \\ f(l_{21}, l_{22}, \dots , l_{2r}) \\ \vdots \\ f(l_{n1}, l_{n2}, \dots, l_{nr}) \end{bmatrix} \]

.l is a list of lists or a list of vectors (data frame)
.f is a function that takes \(r\) arguments
Returns a list by default; variants pmap_dfr(), pmap_dfc(), pmap_dbl() are analogous to map()

`map2()` and `pmap()` Examples

Argument names are arbitrary:

map2_dbl(6:10, 1:5, \(foo, bar) foo^bar)

[1]      6     49    512   6561 100000

df <- tibble(col1 = 1:5, col2 = 5:1, col3 = -2:2)
df

# A tibble: 5 × 3
   col1  col2  col3
  <int> <int> <int>
1     1     5    -2
2     2     4    -1
3     3     3     0
4     4     2     1
5     5     1     2

Argument names need to match df

pmap_dbl(df, \(col1, col2, col3) col3 / (col1 + col2)^2)

[1] -0.05555556 -0.02777778  0.00000000  0.02777778  0.05555556

💪 Exercise A - (3 min)

Recall this code snippet from our earlier lecture on simulation:

dice_sum <- function() {
# Roll a pair of fair, six-sided dice and return their sum
  die1 <- sample(1:6, 1)
  die2 <- sample(1:6, 1)
  die1 + die2
}
sims <- map_dbl(1:10000, \(i) dice_sum())

Use your new-found knowledge of purrr to explain line 7.
Write a for loop to replace line 7.

Sim Example - Estimators of \(\sigma^2\)

\(X_1, ..., X_n \sim \text{iid N}(\mu, \sigma^2)\)
\(S^2\) is the “usual” estimator; \(\hat{\sigma}^2\) is the MLE \[ S^2 \equiv \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X}_n)^2, \quad \widehat{\sigma}^2 \equiv \frac{1}{n}\sum_{i=1}^n (X_i - \bar{X}_n)^2. \]
\(S^2\) is unbiased; \(\mathbb{E}[\widehat{\sigma}^2] = (n-1)\sigma^2/n\).
\(\text{Bias}(\widehat{\sigma}^2) = \mathbb{E}(\hat{\sigma}^2 - \sigma^2) = -\sigma^2/n\)

The Whole Kit & Caboodle

Write a function to generate simulation data.
Write a function to calculate your estimates.
Run the simulation for fixed parameters. Repeat many times:
1. Draw simulation data.
2. Calculate estimates.
Repeat step 3 over a grid range of parameter values.
Store and summarize the results.

1 - Function to Generate Sim Data

draw_sim_data <- function(n, s_sq) {
  rnorm(n, sd = sqrt(s_sq))
}

For simplicity, take \(\mu = 0\) in our sim (doesn’t matter).
Allow n and s_sq to vary

2 - Function to Calculate Estimates

get_estimates <- function(x) {
  c('usual' = var(x),
    'MLE' = mean((x - mean(x))^2)) # divide by n; not (n - 1)
}

Note

You don’t necessarily have to return a vector: you could just as well return a list.

`possibly()`

Sometimes an estimator may fail to exist for some datasets.
If an error occurs, don’t want the simulation to crash.
possibly() a function to fail gracefully, suppressing errors and storing a designated value in case of failure:

x <- list('Core ERM', 3, 4)
map_dbl(x, log)

Error in `map_dbl()`:
ℹ In index: 1.
Caused by error:
! non-numeric argument to mathematical function

map_dbl(x, possibly(log, NA_real_))

[1]       NA 1.098612 1.386294

3 - Run Sim for Fixed Parameters

# Generate list of three simulation datasets as an example
set.seed(1693)
sim_datasets <- map(1:3, \(i) draw_sim_data(n = 5, s_sq = 1))
sim_datasets

[[1]]
[1] -0.5070343 -0.8750463 -0.1883818 -1.1772701  1.8960400

[[2]]
[1]  0.3824529  1.0291447 -1.3998332  1.3983002 -0.3950949

[[3]]
[1]  0.1204153 -0.9611377 -0.3870477  0.7757843  0.6520975

# Compute the estimates for all three datasets; bind rows
map_dfr(sim_datasets, get_estimates)

# A tibble: 3 × 2
  usual   MLE
  <dbl> <dbl>
1 1.47  1.18 
2 1.27  1.01 
3 0.527 0.421

3 - Run Sim for Fixed Parameters

Now that we have the idea, do more replications!

set.seed(1693)
nreps <- 5000
sim_datasets <- map(1:nreps, \(i) draw_sim_data(n = 5, s_sq = 1))
sim_estimates <- map_dfr(sim_datasets, get_estimates)
sim_estimates

# A tibble: 5,000 × 2
   usual   MLE
   <dbl> <dbl>
 1 1.47  1.18 
 2 1.27  1.01 
 3 0.527 0.421
 4 0.858 0.686
 5 0.863 0.690
 6 0.942 0.754
 7 0.521 0.417
 8 2.84  2.27 
 9 0.941 0.753
10 0.775 0.620
# ℹ 4,990 more rows

3 - Run Sim for Fixed Parameters

Using sim_estimates we can approximate the bias of \(\widehat{\sigma}^2\)

# Sim parameters: n = 5,  s_sq = 1, 
n <- 5
s_sq <- 1
sim_estimates |> 
  summarize(bias_true = -s_sq / n, bias_sim = mean(MLE - s_sq))

# A tibble: 1 × 2
  bias_true bias_sim
      <dbl>    <dbl>
1      -0.2   -0.197

Great: our simulation appears to work!

4 - Run Sim over Parameter Grid

Wrap step 3 from above into a function:

run_sim <- function(n, s_sq, nreps = 5000) { # default: nreps = 5000
  sim_datasets <- map(1:nreps, \(i) draw_sim_data(n, s_sq)) 
  map_dfr(sim_datasets, get_estimates)
}
set.seed(1693)
run_sim(n = 5, s_sq = 1)

# A tibble: 5,000 × 2
   usual   MLE
   <dbl> <dbl>
 1 1.47  1.18 
 2 1.27  1.01 
 3 0.527 0.421
 4 0.858 0.686
 5 0.863 0.690
 6 0.942 0.754
 7 0.521 0.417
 8 2.84  2.27 
 9 0.941 0.753
10 0.775 0.620
# ℹ 4,990 more rows

4 - Run Sim over Parameter Grid

Use expand_grid() from tidyr to set up parameter grid

sim_params <- expand_grid(n = 3:5, s_sq = 1:3)
sim_params

# A tibble: 9 × 2
      n  s_sq
  <int> <int>
1     3     1
2     3     2
3     3     3
4     4     1
5     4     2
6     4     3
7     5     1
8     5     2
9     5     3

4 - Run Sim over Parameter Grid

Use purrr::pmap() to call run_sim() on each row of sim_params

sim_results <- pmap(sim_params, run_sim)
sim_results

[[1]]
# A tibble: 5,000 × 2
    usual    MLE
    <dbl>  <dbl>
 1 0.370  0.247 
 2 0.373  0.249 
 3 0.135  0.0903
 4 2.95   1.97  
 5 0.0589 0.0393
 6 0.975  0.650 
 7 1.63   1.09  
 8 1.30   0.865 
 9 1.53   1.02  
10 0.823  0.548 
# ℹ 4,990 more rows

[[2]]
# A tibble: 5,000 × 2
   usual    MLE
   <dbl>  <dbl>
 1 0.551 0.368 
 2 0.513 0.342 
 3 0.478 0.319 
 4 1.54  1.03  
 5 0.410 0.274 
 6 2.15  1.43  
 7 0.447 0.298 
 8 0.145 0.0969
 9 0.455 0.303 
10 0.466 0.311 
# ℹ 4,990 more rows

[[3]]
# A tibble: 5,000 × 2
    usual    MLE
    <dbl>  <dbl>
 1 2.31   1.54  
 2 0.460  0.306 
 3 0.598  0.399 
 4 2.32   1.55  
 5 0.0177 0.0118
 6 1.45   0.965 
 7 6.55   4.37  
 8 2.81   1.87  
 9 0.261  0.174 
10 2.58   1.72  
# ℹ 4,990 more rows

[[4]]
# A tibble: 5,000 × 2
   usual    MLE
   <dbl>  <dbl>
 1 1.37  1.03  
 2 0.790 0.592 
 3 0.507 0.380 
 4 0.859 0.644 
 5 0.584 0.438 
 6 1.42  1.06  
 7 6.23  4.67  
 8 1.24  0.932 
 9 0.630 0.472 
10 0.123 0.0924
# ℹ 4,990 more rows

[[5]]
# A tibble: 5,000 × 2
   usual   MLE
   <dbl> <dbl>
 1 2.99  2.24 
 2 0.845 0.633
 3 0.635 0.476
 4 1.59  1.20 
 5 3.36  2.52 
 6 1.64  1.23 
 7 0.685 0.514
 8 9.98  7.49 
 9 1.95  1.46 
10 1.25  0.934
# ℹ 4,990 more rows

[[6]]
# A tibble: 5,000 × 2
   usual   MLE
   <dbl> <dbl>
 1 5.39  4.04 
 2 1.98  1.48 
 3 3.48  2.61 
 4 5.56  4.17 
 5 3.45  2.59 
 6 3.66  2.75 
 7 1.36  1.02 
 8 0.597 0.448
 9 2.49  1.87 
10 9.44  7.08 
# ℹ 4,990 more rows

[[7]]
# A tibble: 5,000 × 2
    usual    MLE
    <dbl>  <dbl>
 1 0.649  0.519 
 2 0.0978 0.0782
 3 0.988  0.791 
 4 0.944  0.755 
 5 0.661  0.528 
 6 1.37   1.09  
 7 1.17   0.938 
 8 1.32   1.05  
 9 0.998  0.798 
10 2.02   1.61  
# ℹ 4,990 more rows

[[8]]
# A tibble: 5,000 × 2
   usual   MLE
   <dbl> <dbl>
 1 2.25  1.80 
 2 0.337 0.270
 3 1.69  1.35 
 4 2.05  1.64 
 5 1.48  1.18 
 6 0.946 0.757
 7 4.99  3.99 
 8 0.865 0.692
 9 0.635 0.508
10 3.32  2.66 
# ℹ 4,990 more rows

[[9]]
# A tibble: 5,000 × 2
   usual   MLE
   <dbl> <dbl>
 1 0.637 0.510
 2 3.90  3.12 
 3 2.11  1.69 
 4 5.30  4.24 
 5 2.28  1.82 
 6 2.07  1.65 
 7 3.19  2.56 
 8 0.742 0.594
 9 6.42  5.13 
10 2.55  2.04 
# ℹ 4,990 more rows

5 - Summarize Results

get_summary_stats <- function(df) {
  df |> 
    summarize(usual_mean = mean(usual), 
              MLE_mean = mean(MLE), 
              usual_var = var(usual), 
              MLE_var = var(MLE))
}

summary_stats <- sim_results |>
  map_dfr(get_summary_stats) |>
  bind_cols(sim_params)

summary_stats

# A tibble: 9 × 6
  usual_mean MLE_mean usual_var MLE_var     n  s_sq
       <dbl>    <dbl>     <dbl>   <dbl> <int> <int>
1      0.990    0.660     0.966   0.429     3     1
2      1.96     1.31      3.82    1.70      3     2
3      2.93     1.95      8.58    3.81      3     3
4      0.995    0.746     0.688   0.387     4     1
5      2.00     1.50      2.60    1.46      4     2
6      2.95     2.21      5.80    3.26      4     3
7      0.997    0.797     0.484   0.310     5     1
8      2.02     1.62      2.09    1.34      5     2
9      3.03     2.42      4.71    3.01      5     3

5 - Summarize Results

summary_stats |> 
  mutate(MLE_bias = -s_sq / n, 
         MLE_sim_bias = MLE_mean - s_sq,
         usual_sim_bias = usual_mean - s_sq,
         MLE_rmse = sqrt(MLE_sim_bias^2 + MLE_var),
         usual_rmse = sqrt(usual_sim_bias^2 + usual_var)) |> 
  select(MLE_bias, MLE_sim_bias, MLE_rmse, usual_rmse)

# A tibble: 9 × 4
  MLE_bias MLE_sim_bias MLE_rmse usual_rmse
     <dbl>        <dbl>    <dbl>      <dbl>
1   -0.333       -0.340    0.738      0.983
2   -0.667       -0.691    1.48       1.96 
3   -1           -1.05     2.22       2.93 
4   -0.25        -0.254    0.672      0.829
5   -0.5         -0.498    1.31       1.61 
6   -0.75        -0.789    1.97       2.41 
7   -0.2         -0.203    0.592      0.696
8   -0.4         -0.383    1.22       1.45 
9   -0.6         -0.578    1.83       2.17

flowchart TD
  start[Parameter Grid] --> p1["(n=3, s_sq=1)"] & pdot[...] & p9["(n=5, s_sq=3)"]
  
  p1 --> |"Generate"| d1["Dataset 1: n=3 values"] & ddot[...] & d5000["Dataset 5000: n=3 values"]
  
  d1 --> |"Calculate"| e1["Estimates: Usual S², MLE σ̂²"]
  d5000 --> |"Calculate"| e5000["Estimates: Usual S², MLE σ̂²"]
  
  e1 & e5000 --> results1["Results for (n=3, s_sq=1): 5000 rows × 2 columns"]
  
  results1 --> |"Summarize"| sum1["Summary: Means, Variances, Bias"]
  
  p9 --> |"Same process"| sum9["Summary: Means, Variances, Bias"]
  
  sum1 & sum9 --> final["Final Results Table: 9 rows × statistics"]
  
  classDef params fill:#ffe6cc,stroke:#d79b00,stroke-width:2px
  classDef datasets fill:#d5e8d4,stroke:#82b366
  classDef estimates fill:#dae8fc,stroke:#6c8ebf
  classDef results fill:#e1d5e7,stroke:#9673a6
  classDef summary fill:#fff2cc,stroke:#d6b656,stroke-width:2px
  
  class start,p1,p9,pdot params
  class d1,d5000,ddot datasets
  class e1,e5000 estimates
  class results1 results
  class sum1,sum9 summary
  class final summary

Design Choices in Simulation Studies

Many different ways to set up the same simulation study
Trade-offs: memory usage, code organization, reusability
Three common approaches:
- v1: Generate all datasets first, then calculate estimates
- v2: Process one dataset at a time
- v3: Inline everything

v1 - Generate All Datasets First

draw_sim_data <- function(n, s_sq) { rnorm(n, sd = sqrt(s_sq)) }
get_estimates <- function(data) { 
  c('usual' = var(data), 'MLE' = mean((data - mean(data))^2)) 
}

run_sim <- function(n, s_sq, nreps = 5000) {
  # Generate all datasets, then calculate estimates
  sim_datasets <- map(1:nreps, \(i) draw_sim_data(n, s_sq))
  map_dfr(sim_datasets, get_estimates)
}

Pros: clearest conceptual separation, easier debugging, can reuse datasets with different estimators, enables parallelization of separate steps
Cons: high memory usage - stores all datasets simultaneously

v2 - Process One Dataset at a Time

run_single_rep <- function(n, s_sq) {
  data <- draw_sim_data(n, s_sq)  # Generate one dataset
  get_estimates(data)             # Calculate estimates immediately
}

run_sim_memory_efficient <- function(n, s_sq, nreps = 5000) {
  map_dfr(1:nreps, \(i) run_single_rep(n, s_sq))
}

Pros: low memory usage, maintains function modularity, good for large datasets
Cons: can’t inspect intermediate data, less flexibility for complex designs

v3: Inline Everything

run_sim_inline <- function(n, s_sq, nreps = 5000) {
  map_dfr(1:nreps, function(i) {
    # Generate data and calculate estimates inline
    data <- rnorm(n, sd = sqrt(s_sq))
    c('usual' = var(data), 'MLE' = mean((data - mean(data))^2))
  })
}

Pros: simplest code structure, low memory usage, fewer function calls
Cons: no reusable components, harder to test parts separately, less flexible

Kahneman - Thinking Fast & Slow

Amos [Tversky] and his students Tom Gilovich and Robert Vallone caused a stir with their study of misperceptions of randomness in basketball. The “fact” that players occasionally acquire a hot hand is generally accepted by players, coaches, and fans. The inference is irresistible: a player sinks three or four baskets in a row and you cannot help forming the causal judgment that this player is now hot, with a temporarily increased propensity to score. Players on both teams adapt to this judgment—teammates are more likely to pass to the hot scorer and the defense is more likely to doubleteam.

Kahneman - Thinking Fast & Slow

Analysis of thousands of sequences of shots led to a disappointing conclusion: there is no such thing as a hot hand in professional basketball, either in shooting from the field or scoring from the foul line. Of course, some players are more accurate than others, but the sequence of successes and missed shots satisfies all tests of randomness. The hot hand is entirely in the eye of the beholders, who are consistently too quick to perceive order and causality in randomness. The hot hand is a massive and widespread cognitive illusion.

Oops!

Simulation is more powerful than derivation. Don’t be clever; be right.

In Search of the Hot Hand 🔥 ✋

shots is a vector of zeros and ones: shots made by a basketball player
hot hand: 1 becomes more likely after a “streak” of 1s.
Before Miller & Sanjurjo (2018), here’s how researchers looked for a hot hand:
- Identify each streak in shots: sequence of k ones in a row
- For each streak find the shot that immediately follows it.
- Collect these shots into vector after_shots.
- Hot hand \(\implies\) mean(after_streak) \(>\) mean(shots)
- iid \(\implies\) mean(after_streak) \(\approx\) mean(shots)

💪 In-class Exercise / Problem Set

Design a simulation study to show that the procedure described on the previous slide is flawed.
Reflect on the importance of simulation in research!

Note

For a more readable explanation of the Hot Hand bias, see Miller & Sanjurjo (2019).

Running a Simulation Study

What is this?

🐱 More on purrr

What is a functional?

map(.x, .f) replaces for()

purrr:map() Example

map_dfc() and map_dfr()

Supplying a vector to map()

map_dbl(.x, .f) and friends

map_xxx() Examples

purrr::map2(.x, .y, .f)

Parallel map: pmap(.l, .f)

map2() and pmap() Examples

💪 Exercise A - (3 min)

Sim Example - Estimators of \(\sigma^2\)

The Whole Kit & Caboodle

1 - Function to Generate Sim Data

2 - Function to Calculate Estimates

possibly()

3 - Run Sim for Fixed Parameters

3 - Run Sim for Fixed Parameters

3 - Run Sim for Fixed Parameters

4 - Run Sim over Parameter Grid

4 - Run Sim over Parameter Grid

4 - Run Sim over Parameter Grid

5 - Summarize Results

5 - Summarize Results

Design Choices in Simulation Studies

v1 - Generate All Datasets First

v2 - Process One Dataset at a Time

v3: Inline Everything

Kahneman - Thinking Fast & Slow

Kahneman - Thinking Fast & Slow

Oops!

In Search of the Hot Hand 🔥 ✋

💪 In-class Exercise / Problem Set

🐱 More on `purrr`

`map(.x, .f)` replaces `for()`

`purrr:map()` Example

`map_dfc()` and `map_dfr()`

Supplying a vector to `map()`

`map_dbl(.x, .f)` and friends

`map_xxx()` Examples

`purrr::map2(.x, .y, .f)`

Parallel map: `pmap(.l, .f)`

`map2()` and `pmap()` Examples

`possibly()`