Monte Carlo Simulation Basics in R

Francis J. DiTraglia

University of Oxford

Monte Carlo Simulation: What & Why?

  1. Use a computer to repeatedly carry out a random experiment.
  2. Keep track of the outcomes; compute relative frequencies.
  3. Make plots / tables of results.

Casino de Monte-Carlo
  • In a simulation, we know the truth: check how estimators, confidence intervals, tests perform in finite samples.

  • Check probability calculations by simulating; approximate probabilities that can’t be computed analytically.

Some References on Simulation

To earn more about what’s going on “under the hood”

But what is randomness, Socrates?

Pseudorandom Numbers

  • R cannot generate “truly” random numbers
  • Computers are deterministic: same input, same output
  • “Random” draws on a computer are really pseudorandom
    • Intuitively: “as if random”
    • Generated by a deterministic algorithm
    • Statistically random: indistinguishable from real random numbers using statistical tests
  • Pseudorandom is good enough for our purposes
  • I’ll write RNG for “pseudo-random number generator” below

Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin.1

John von Neumann (1903-1957)

Rev. Marin Mersenne (1588-1648)

Twister (1966-present)

A Rough Idea of How it Works

  1. Find a way to make independent Uniform(0,1) draws.
  2. Find a way to transform uniforms into draws from the desired probability distribution.

Lehmer Random Number Generator

  1. Choose integers \(m, a\) where \(m > 0\) and \(0 < a < m\).
  2. Choose integer seed \(x_0\) where \(0 < x_0 < m\). (starting value)
  3. Construct \(\{x_n\}\) according to: \(\boxed{x_{n+1} = a x_n \mod m}\).
    • i.e. \(x_{n+1}=\) remainder when \(ax_n\) is divided by \(m\).
  4. \(\{x_n/m\} \approx\) sequence of iid Uniform(0,1) draws.

Note

Completely deterministic: same seed and \(m, a\) parameters \(\implies\) same draws!

Lehmer RNG: Example

You will expand / improve this example on the problem set.

# Parameters - not great choices: this is only an example
m <- 2^16 + 1 # positive integer, here: 65537
a <- 75 # positive integer betweeen 0 and m

# Seed
x0 <- 1983 # between 0 and m

# x %% y is the remainder when x is divided by y
x1 <- (a * x0) %% m
x1 
[1] 17651
x2 <- (a * x1) %% m
x3 <- (a * x2) %% m
x4 <- (a * x3) %% m
x5 <- (a * x4) %% m

c(x1, x2, x3, x4, x5) / m
[1] 0.26932878 0.19965821 0.97436563 0.07742191 0.80664358

sample() - Draw from a Vector in R

sample(x, size, replace = FALSE, prob = NULL)
  • Generate a random sample of length size from the vector x
  • If x is a positive integer, samples are drawn from 1:x
  • By default, draws are made without replacement
  • To draw with replacement set replace = TRUE
  • By default, elements of x are sampled with equal probability
  • To sample with different probabilities, supply them as prob

sample() - Examples1

marbles <- c('red', 'blue', 'green') # a bowl with three marbles
sample(marbles, size = 2) # draw two marbles without replacement
[1] "blue" "red" 
sample(10, size = 5) # five draws without replacement from 1:10
[1] 7 2 5 6 4
# Sample with replacement
sample(c('Core', 'ERM', 'Rules'), 10, replace = TRUE) 
 [1] "ERM"   "Rules" "Rules" "Rules" "ERM"   "Rules" "Rules" "ERM"   "Core" 
[10] "Rules"
# Unequal probabilities: P(FALSE) = 0.7
sample(c(TRUE, FALSE), 20, replace = TRUE, prob = c(0.3, 0.7)) 
 [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE
[13] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE

🌱 Setting the Seed in R

set.seed(YOUR_SEED_GOES_HERE)
  • Remember: seed of an RNG is its initial condition and
  • RNG is deterministic, so same seed gives same draws.
  • Supply a single numeric value (interpreted as an integer)
set.seed(1983)
sample(100, 2)
[1] 67 18
sample(100, 2)
[1] 77 94
set.seed(1983)
sample(100, 2)
[1] 67 18

🌱 Why and when to set the seed?

  • Set the seed once at the beginning of your simulation study.
  • Ensures replicability:
    • You will get the same results if you re-run your code
    • I will get the same results as you if I test your code
  • Pick a number that you like or obtain a real random seed from random.org!

💪 Exercise A - (4 min)

  1. What happens if I don’t supply size to sample()?
  2. Does it matter if I set replace = TRUE or FALSE here?
sample(1:6, size = 1)
  1. Write a function called rbern() that uses sample() to make a sequence of \(n\) iid Bernoulli\((p)\) draws. Its arguments are n, the number of draws, and p, the probability of success. It should return a vector n of zeros and ones. Check your results by making a large number of draws and verifying that the fraction of ones is approximately \(p\).

Mind your Ps and Qs (and Ds and Rs).

  • Base R has functions for many common RVs
  • Each RV has an abbreviation: e.g. norm for normal
  • Each RV has four related functions:
    • d = density if continuous, mass function if discrete
    • p = CDF
    • q = quantile function (inverse CDF)
    • r makes random draws
  • E.g. dnorm(), pnorm(), qnorm(), rnorm()

Built-in Random Variables in R

R commands Distribution
d/p/q/rbeta Beta
d/p/q/rbinom Binomial
d/p/q/rcauchy Cauchy
d/p/q/rchisq Chi-Squared
d/p/q/rexp Exponential
d/p/q/rf F
d/p/q/rgamma Gamma
d/p/q/rgeom Geometric
d/q/p/rhyper Hypergeometric
R commands Distribution
d/p/q/rlogis Logistic
d/p/q/rlnorm Log Normal
d/p/q/rnbinom Negative Binomial
d/p/q/rnorm Normal
d/p/q/rpois Poisson
d/p/q/rt Student’s t
d/p/q/runif Uniform
d/p/q/rweibull Weibull

Example: Standard Normal Density

Warning

R parameterizes the normal distribution in terms of its mean and standard deviation.

library(tidyverse)
mu <- 0
sigma <- 1

normal_density <- tibble(x = seq(from = -4, to = 4, by = 0.01)) |> 
  mutate(density = dnorm(x, mu, sigma))

normal_density |> 
  ggplot(aes(x, density)) +
  geom_line()

Example: Standard Normal Density

Example: Standard Normal CDF

Warning

R parameterizes the normal distribution in terms of its mean and standard deviation.

mu <- 0
sigma <- 1

normal_cdf <- tibble(x = seq(from = -4, to = 4, by = 0.01)) |> 
  mutate(cdf = pnorm(x, mu, sigma))

normal_cdf |> 
  ggplot(aes(x, cdf)) +
  geom_line()

Example: Standard Normal CDF

Example: Standard Normal Draws

set.seed(1234)
normal_sims <- rnorm(500, mu, sigma)
tibble(normal_sims) |> 
  ggplot(aes(x = normal_sims)) +
  geom_histogram(bins = 20)

Example: Binomial(10, 0.2) pmf

n <- 10
p <- 0.2

binom_pmf <- tibble(x = 0:n) |>  
  mutate(p = dbinom(x, n, p))

binom_pmf |> 
  ggplot(aes(x = x, y = p)) +
  geom_point() + 
  geom_segment(aes(xend = x, yend = 0)) + 
  labs(title = 'Binomial(10, 0.2) pmf', y = 'p(x)')

Example: Binomial(10, 0.2) pmf

Example: Binomial(10, 0.2) CDF

n <- 10
p <- 0.2

binom_cdf <- tibble(x = seq(0, n + 1, 0.01)) |>  
  mutate(cdf = pbinom(x, n, p))

binom_cdf |> 
  ggplot(aes(x = x, y = cdf)) +
  geom_line() +
  labs(title = 'Binomial(10, 0.2) CDF', y = 'F(x)')

Example: Binomial(10, 0.2) CDF

Example: Binomial(10, 0.2) Draws

set.seed(1693)
binom_sims <- rbinom(500, n, p)

tibble(binom_sims) |> 
  ggplot(aes(x = binom_sims)) +
  geom_bar()

💪 Exercise B - (10 min)

  1. Consult ?dexp() and then carry out the following:
    1. Plot the density and CDF of an Exponential(1) RV.
    2. Plot a histogram of 500 draws from an Exponential(1) distribution.
  2. Consult ?dpois() and then carry out the following:
    1. Plot the mass function and CDF of a Poisson(5) RV.
    2. Plot a barchart of 500 draws from a Poisson(5) distribution.
  3. Suppose that we observe a sample of \(n=16\) iid normal RVs with unknown mean \(\mu\) and known variance \(\sigma^2 = 1\).
    1. A confidence interval for \(\mu\) takes the form \(\bar{X}\pm \text{ME}\) where \(\text{ME}\) is the margin of error. Use qnorm() to compute \(\text{ME}\) for an 89% interval.
    2. Simulate 16 draws from a \(N(0,\sigma=4)\) distribution and construct the interval from part (a). Does it cover the true value of \(\mu\)?

The Inverse Transformation Method

  • Already know how make uniform draws.
  • Simulate arbitrary RV with CDF \(F\) and quantile function \(Q\).
  • Inverse Transformation Method
    1. Generate \(U \sim \text{Uniform}(0,1)\).
    2. Set \(Y \equiv Q(U) \implies \mathbb{P}(Y\leq y) = F(y)\)
  • \(F\) continuous & strictly increasing \(\implies Q = F^{-1}\).
  • Otherwise, \(Q\) is the generalized inverse of \(F\).

Example of Inverse Transformation

\(X \sim \text{Exponential}(1)\)

The CDF of this random variable is \(F(x) = 1 - e^{-x}\). Re-arranging and solving for \(x\) \[ \begin{align*} p &= 1 - e^{-x}\\ x &= -\log(1 - p) \end{align*} \] so \(F^{-1}(p) = -\log(1 - p)\) is the quantile function.

u <- runif(1e5)
exp1_sims <- -log(1 - u)
tibble(x = exp1_sims) |> 
  ggplot(aes(x)) +
  geom_histogram()

Example of Inverse Transformation

Proof: Inverse Transformation

\(U\sim \text{Uniform}(0, 1) \implies Y \equiv F^{-1}(U)\) has CDF \(F\)

\[ \begin{align*} \mathbb{P}(Y \leq y) &= \mathbb{P}\left[ F^{-1}(U) \leq y \right]\\ &= \mathbb{P}\left[ F\left( F^{-1}\left(U \right) \right) \leq F(y) \right] \\ &= \mathbb{P}\left[ U \leq F(y) \right]\\ &= F(y) \end{align*} \]

  1. \(F\) continuous & strictly increasing \(\Rightarrow\) has inverse \(F^{-1}\).
  2. \(F\) is strictly increasing \(\Rightarrow x_1 \leq x_2 \iff F(x_1) \leq F(x_2)\).
  3. \(F\left( F^{-1}\left(x \right)\right) = x\)
  4. \(U \sim \text{Uniform}(0,1) \implies \mathbb{P}(U \leq x) = x\)

Quantile-Quantile (QQ) Plots

  • Compare theoretical vs sample quantiles: do these draws really come from that distribution?
    • Set \(x = Q_\text{Theoretical}(p)\) and \(y = Q_\text{Sample}(p)\)
    • Plot \(x\) against \(y\) for a range of probabilities \(p\).
  • Interpretation
    • \(y = x \Rightarrow\) theoretical & sample quantiles agree.
    • \(y = mx + b\) theoretical & sample quantiles agree after centering and scaling.
  • More details of QQ plots in this blog post.

Example - t Distribution (df = 3)

This distribution has heavy tails relative to a Normal

t3_sims <- rt(500, df = 3)
tibble(x = t3_sims) |> 
  ggplot(aes(sample = x)) + # source of data for *sample* quantiles 
  geom_qq()  # make the QQ plot: theoretical defaults to Normal(0,1)

Example - t Distribution (df = 3)

Sample and theoretical agree when we use right distribution!

tibble(x = t3_sims) |> 
  ggplot(aes(sample = x)) +
  geom_qq(distribution = qt, dparams = list(df = 3))

Plot a line through 1st & 3rd Quartiles

tibble(x = t3_sims) |> 
  ggplot(aes(sample = x)) +
  geom_qq(distribution = qt, dparams = list(df = 3)) + 
  geom_qq_line(col = 'red', distribution = qt, dparams = list(df = 3))

Example: Exponential(1) Simulation

tibble(x = exp1_sims) |> 
  ggplot(aes(sample = x)) +
  geom_qq(distribution = qexp, dparams = list(rate = 1)) + 
  geom_qq_line(col = 'red', distribution = qexp, 
               dparams = list(rate = 1))

💪 Exercise C - (10 min)

Let \(X\) be a random variable with the following density function \[ f(x) = \left\{ \begin{array}{ll} x/2, & x \in [0, 2]\\ 0,& \text{otherwise.} \end{array} \right. \] Write code that uses the inverse transformation method to simulate iid draws from this distribution. Test to make sure that your code works correctly.

A Very Simple Monte Carlo Simulation

  • Compute probabilities for sum for two fair, six-sided dice.
  • Here we know the answer, so we can easily check our code.
  • General Recipe:
    1. Write a function to carry out the experiment once.
    2. Use iteration to call that function a large number of times.
    3. Store and summarize the results; set seed for replicability.

Note

The general recipe can always be applied, but sometimes there’s a faster approach.

1. Write a function to carry out the experiment once. 🎲 🎲

dice_sum <- \() {
# Roll a pair of fair, six-sided dice and return their sum
  die1 <- sample(1:6, 1)
  die2 <- sample(1:6, 1)
  die1 + die2
}

dice_sum()
[1] 9
dice_sum()
[1] 4
dice_sum()
[1] 5

2. Use iteration to call that function a large number of times.

  • We could use a for() loop, but there’s a better way.
  • Functional: a function that takes a another function as an input and returns a vector.
  • Functionals can be used to eliminate loops, making code more transparent.

Functional Programming with purrr

Finishing the dice simulation

  1. Use iteration to call that function a large number of times.
library(tidyverse)
nreps <- 1e4
sims <- map_dbl(1:nreps, \(i) dice_sum())
  1. Summarize the results.
## What is the probability of rolling a 7?
mean(sims == 7)
[1] 0.1675
tibble(x = sims) |> 
  ggplot(aes(x)) + 
  geom_bar()

💪 Exercise D - (15 min)

This problem was initially posed by the famous 17th century gambler Antoine Gombaud, more commonly known as the Chevalier de Méré. Fermat and Pascal discussed its solution in their legendary correspondance that began the study of probability as we know it. Here’s the Chevalier’s question:

Which is more likely: (A) getting at least one six when rolling a single die four times or (B) getting at least one pair of sixes when rolling a pair of dice twenty-four times?

Answer the Chevalier’s question using Monte Carlo Simulation. Assume that all dice are fair and six-sided.

💪 Exercise E - (10 min)

Find an alternative, faster way to carry out the “sum of two fair dice” simulation from above. Hint, think about how you can avoid writing a function to carry out the simulation once and do the whole thing in a single step. Time your alternative approach to see how much faster it is.