sample(1:6, size = 1)
Monte Carlo Simulation Basics in R - Solutions
Exercise A - (4 min)
- What happens if I don’t supply
size
tosample()
? - Does it matter if I set
replace = TRUE
orFALSE
here?
- Write a function called
rbern()
that usessample()
to make a sequence of \(n\) iid Bernoulli\((p)\) draws. Its arguments aren
, the number of draws, andp
, the probability of success. It should return a vectorn
of zeros and ones. Check your results by making a large number of draws and verifying that the fraction of ones is approximately \(p\).
Solutions
- It sets
size
tolength(x)
- No: since we’re only making a single draw from
1:6
, it doesn’t matter if we draw with or without replacement. - E.g.
<- \(n, p) {
rbern sample(0:1, size = n, replace = TRUE, prob = c(1 - p, p))
}mean(rbern(1e6, 0.2)) - 0.2
[1] -0.000149
mean(rbern(1e6, 0.4)) - 0.4
[1] 0.00114
mean(rbern(1e6, 0.6)) - 0.6
[1] -9.1e-05
mean(rbern(1e6, 0.8)) - 0.8
[1] 0.000133
Exercise B - (10 min)
- Consult
?dexp()
and then carry out the following:- Plot the density and CDF of an Exponential(1) RV.
- Plot a histogram of 500 draws from an Exponential(1) distribution.
- Consult
?dpois()
and then carry out the following:- Plot the mass function and CDF of a Poisson(5) RV.
- Plot a barchart of 500 draws from a Poisson(5) distribution.
- Suppose that we observe a sample of \(n=16\) iid normal RVs with unknown mean \(\mu\) and known variance \(\sigma^2 = 1\).
- A confidence interval for \(\mu\) takes the form \(\bar{X}\pm \text{ME}\) where \(\text{ME}\) is the margin of error. Use
qnorm()
to compute \(\text{ME}\) for an 89% interval. - Simulate 16 draws from a \(N(0,\sigma=4)\) distribution and construct the interval from part (a). Does it cover the true value of \(\mu\)?
- A confidence interval for \(\mu\) takes the form \(\bar{X}\pm \text{ME}\) where \(\text{ME}\) is the margin of error. Use
Solution
library(tidyverse)
# Part 1a
<- tibble(x = seq(0, 5, 0.01)) |>
exponential1 mutate(density = dexp(x, 1), cdf = pexp(x, 1))
|>
exponential1 ggplot(aes(x, density)) +
geom_line() +
labs(title = 'Exponential(1) Density')
|>
exponential1 ggplot(aes(x, cdf)) +
geom_line() +
labs(title = 'Exponential(1) CDF')
# Part 1b
set.seed(690872)
tibble(x = rexp(500, 1)) |>
ggplot(aes(x)) +
geom_histogram(bins = 25)
# Part 2a
# pmf
tibble(x = 0:20) |>
mutate(p = dpois(x, 5)) |>
ggplot(aes(x, p)) +
geom_point() +
geom_segment(aes(xend = x, yend = 0)) +
labs(title = 'Poisson(5) pmf', y = 'p(x)')
# CDF
tibble(x = seq(0, 20, 0.01)) |>
mutate(cdf = ppois(x, 5)) |>
ggplot(aes(x, cdf)) +
geom_line() +
labs(title = 'Poisson(5) CDF', y = 'F(x)')
# Part 2b
<- rpois(500, 5)
poisson5_sims tibble(poisson5_sims) |>
ggplot(aes(x = poisson5_sims)) +
geom_bar()
# Part 3a
<- 16
n <- 1
sigma <- sigma / sqrt(n)
SE <- (1 - 0.89)
alpha <- qnorm(1 - alpha / 2) * SE
ME # Just under 0.4 ME
[1] 0.3995483
# Part 3b
<- rnorm(n, mean = 0, sd = sigma) |>
xbar mean()
<- xbar + c(-1, 1) * ME
CI CI
[1] -0.03107317 0.76802340
Exercise C - (10 min)
Let \(X\) be a random variable with the following density function \[ f(x) = \left\{ \begin{array}{ll} x/2, & x \in [0, 2]\\ 0,& \text{otherwise.} \end{array} \right. \] Write code that uses the inverse transformation method to simulate iid draws from this distribution. Test to make sure that your code works correctly.
Solution
Integrating, the CDF of this random variable is \(F(x) = x^2/4\). Re-arranging, its quantile function is \(F^{-1}(p) = 2 \sqrt{p}\).
set.seed(987654321)
<- 2 * sqrt(runif(1e5))
triangular_sims tibble(x = triangular_sims) |>
ggplot(aes(x)) +
geom_histogram(bins = 50)
tibble(x = triangular_sims) |>
ggplot(aes(sample = x)) +
geom_qq(distribution = \(p) 2 * sqrt(p))
Exercise D - (15 min)
This problem was initially posed by the famous 17th century gambler Antoine Gombaud, more commonly known as the Chevalier de Méré. Fermat and Pascal discussed its solution in their legendary correspondance that began the study of probability as we know it. Here’s the Chevalier’s question:
Which is more likely: (A) getting at least one six when rolling a single die four times or (B) getting at least one pair of sixes when rolling a pair of dice twenty-four times?
Answer the Chevalier’s question using Monte Carlo Simulation. Assume that all dice are fair and six-sided.
Solution
<- function(){
experimentA <- sample(1:6, size = 4, replace = TRUE)
rolls <- sum(rolls == 6) > 0
condition return(condition)
}
<- function(){
experimentB <- sample(1:6, size = 24, replace = TRUE)
die1 <- sample(1:6, size = 24, replace = TRUE)
die2 <- sum((die1 == die2) & (die1 == 6)) > 0
condition return(condition)
}
<- 1e5
nreps
<- map_dbl(1:nreps, \(i) experimentA())
simsA mean(simsA)
[1] 0.51659
<- map_dbl(1:nreps, \(i) experimentB())
simsB mean(simsB)
[1] 0.49192
It appears that A is slightly more likely than B. We might wonder, however, if this is a real difference, or merely a chance fluctuation that arose in our simulation. A simple way to assess this is by computing a standard error for the difference of estimated probabilities. Let \(p_A\) be the probability of \(A\) and \(p_B\) be the probability of \(B\). We have estimates \(\widehat{p}_A\) and \(\widehat{p}_B\), each obtained from an independent sample of nreps
observations. Hence, we have \[
\text{SE}(\widehat{p}_A - \widehat{p}_B) = \sqrt{\frac{p_A (1 - p_A)}{\text{nreps}} + \frac{p_B(1 - p_B)}{\text{nreps}}}.
\] Unfortunately the standard error depends on \(p_A\) and \(p_B\) which we don’t know! One option would be to substitute our estimates of these values to approximate the standard error. Another option is to compute the worst case standard error, by plugging in the values of \(p_A\) and \(p_B\) that make the above function as large as possible. The “actual” standard error will necessarily be no larger than this worst case value. A bit of calculus shows that setting \(p_A = p_B = 1/2\) maximizes \(\text{SE}(\widehat{p}_A - \widehat{p}_B)\). Hence, the worst-case standard error is
<- sqrt(0.5 * (1 - 0.5) / nreps + 0.5 * (1 - 0.5) / nreps)
SE_worst SE_worst
[1] 0.002236068
This is very small compared to our estimated difference. For example, we can construct a \(>99.7\%\) confidence interval for \((p_A - p_B)\) as follows:
mean(simsA) - mean(simsB)) + c(-1, 1) * 3 * SE_worst (
[1] 0.0179618 0.0313782
This means we can be extremely confidence that \(p_A\) is indeed slightly larger than \(p_B\). If the confidence interval had come out too wide for our liking, we could simply go back and choose a larger value of nreps
. Isn’t simulation great?!
Exercise E - (10 min)
Find an alternative, faster way to carry out the “sum of two fair dice” simulation from above. Hint, think about how you can avoid writing a function to carry out the simulation once and do the whole thing in a single step. Time your alternative approach to see how much faster it is.
Solution
Use vectorized operations
<- \() {
dice_sum # Roll a pair of fair, six-sided dice and return their sum
<- sample(1:6, 1)
die1 <- sample(1:6, 1)
die2 + die2
die1
}
<- function(nreps) {
version1 map_dbl(1:nreps, \(i) dice_sum())
}
<- function(nreps) {
version2 <- sample(1:6, nreps, replace = TRUE)
die1 <- sample(1:6, nreps, replace = TRUE)
die2 + die2
die1
}
set.seed(1234)
system.time(
<- version1(1e5)
sims1 )
user system elapsed
0.650 0.001 0.651
system.time(
<- version2(1e5)
sims2 )
user system elapsed
0.005 0.000 0.005