This tutorial has two parts. In the first part you’ll learn how to plot functions in R. In the second part you’ll learn how we can use R to study the Binomial random variable.
You’ve already seen a number of examples of the plot
function in R. So far we’ve mainly plotted points but we can actually use the same command to plot functions. The basic idea is to set up the x
and y
coordinates of some points that are close enough together that it looks like we’ve plotted a smooth curve. Everything on a computer is actually discrete: your eye is simply fooled into thinking that things look continuous. All curves are actually many tiny line segments!
Let’s start by plotting a few points on the curve \(f(x) = x^2\)
x <- seq(from = -1, to = 1, by = 0.5)
y <- x^2
plot(x, y)
Those points aren’t “dense” enough to approximate the whole curve. Let’s try making a finer plot:
x <- seq(from = -1, to = 1, by = 0.1)
y <- x^2
plot(x, y)
This looks better, but how about even finer?
x <- seq(from = -1, to = 1, by = 0.01)
y <- x^2
plot(x, y)
This is more like what we’re after, but we’d rather have a smooth curve rather than all those little “dots.” We can do this as follows:
plot(x, y, type = "l")
Plot \(f(x) = x^3\) on \([-2,2]\).
x <- seq(from = -2, to = 2, by = 0.01)
y <- x^3
plot(x, y, type = 'l')
Plot \(f(x) = \log(x)\) on \([0.5, 1.5]\).
x <- seq(from = 0.5, to = 1.5, by = 0.01)
y <- log(x)
plot(x, y, type = 'l')
Plot \(f(x) = \sin(x)\) on \([0, 6\pi]\).
x <- seq(from = 0, to = 6 * pi, by = 0.01)
y <- sin(x)
plot(x, y, type = 'l')
The command plot
creates a new plot. To add to the existing plot, we could use the command lines
:
x <- seq(from = 0, to = 1, by = 0.01)
y1 <- x^2
y2 <- x
plot(x, y1, col = 'blue', type = 'l')
lines(x, y2, col = 'red')
But using lines
can get tricky since the curve we add might not fit inside the plot we’ve already created:
x <- seq(from = 0, to = 1, by = 0.01)
y1 <- x^2
y2 <- x + 0.75
plot(x, y1, col = 'blue', type = 'l')
lines(x, y2, col = 'red')
We could use the xlim
and ylim
arguments to the first call to plot
to take care of this, but this can be a pain. The easiest way to avoid this is to use matplot
rather than plot
followed by lines
, since matplot
will figure out the overall x- and y-axis limits automatically. To use this command we bind the y-coordinates of our two (or more) curves into a matrix. We could redo the preceding example using matplot
as follows.
The function to bind vectors into a matrix is cbind
– it will “glue” its inputs together side-by-side vertically to create a matrix where each column is the vectors you input:
y <- cbind(y1, y2)
y
## y1 y2
## [1,] 0.0000 0.75
## [2,] 0.0001 0.76
## [3,] 0.0004 0.77
## [4,] 0.0009 0.78
## [5,] 0.0016 0.79
## [6,] 0.0025 0.80
## [7,] 0.0036 0.81
## [8,] 0.0049 0.82
## [9,] 0.0064 0.83
## [10,] 0.0081 0.84
## [11,] 0.0100 0.85
## [12,] 0.0121 0.86
## [13,] 0.0144 0.87
## [14,] 0.0169 0.88
## [15,] 0.0196 0.89
## [16,] 0.0225 0.90
## [17,] 0.0256 0.91
## [18,] 0.0289 0.92
## [19,] 0.0324 0.93
## [20,] 0.0361 0.94
## [21,] 0.0400 0.95
## [22,] 0.0441 0.96
## [23,] 0.0484 0.97
## [24,] 0.0529 0.98
## [25,] 0.0576 0.99
## [26,] 0.0625 1.00
## [27,] 0.0676 1.01
## [28,] 0.0729 1.02
## [29,] 0.0784 1.03
## [30,] 0.0841 1.04
## [31,] 0.0900 1.05
## [32,] 0.0961 1.06
## [33,] 0.1024 1.07
## [34,] 0.1089 1.08
## [35,] 0.1156 1.09
## [36,] 0.1225 1.10
## [37,] 0.1296 1.11
## [38,] 0.1369 1.12
## [39,] 0.1444 1.13
## [40,] 0.1521 1.14
## [41,] 0.1600 1.15
## [42,] 0.1681 1.16
## [43,] 0.1764 1.17
## [44,] 0.1849 1.18
## [45,] 0.1936 1.19
## [46,] 0.2025 1.20
## [47,] 0.2116 1.21
## [48,] 0.2209 1.22
## [49,] 0.2304 1.23
## [50,] 0.2401 1.24
## [51,] 0.2500 1.25
## [52,] 0.2601 1.26
## [53,] 0.2704 1.27
## [54,] 0.2809 1.28
## [55,] 0.2916 1.29
## [56,] 0.3025 1.30
## [57,] 0.3136 1.31
## [58,] 0.3249 1.32
## [59,] 0.3364 1.33
## [60,] 0.3481 1.34
## [61,] 0.3600 1.35
## [62,] 0.3721 1.36
## [63,] 0.3844 1.37
## [64,] 0.3969 1.38
## [65,] 0.4096 1.39
## [66,] 0.4225 1.40
## [67,] 0.4356 1.41
## [68,] 0.4489 1.42
## [69,] 0.4624 1.43
## [70,] 0.4761 1.44
## [71,] 0.4900 1.45
## [72,] 0.5041 1.46
## [73,] 0.5184 1.47
## [74,] 0.5329 1.48
## [75,] 0.5476 1.49
## [76,] 0.5625 1.50
## [77,] 0.5776 1.51
## [78,] 0.5929 1.52
## [79,] 0.6084 1.53
## [80,] 0.6241 1.54
## [81,] 0.6400 1.55
## [82,] 0.6561 1.56
## [83,] 0.6724 1.57
## [84,] 0.6889 1.58
## [85,] 0.7056 1.59
## [86,] 0.7225 1.60
## [87,] 0.7396 1.61
## [88,] 0.7569 1.62
## [89,] 0.7744 1.63
## [90,] 0.7921 1.64
## [91,] 0.8100 1.65
## [92,] 0.8281 1.66
## [93,] 0.8464 1.67
## [94,] 0.8649 1.68
## [95,] 0.8836 1.69
## [96,] 0.9025 1.70
## [97,] 0.9216 1.71
## [98,] 0.9409 1.72
## [99,] 0.9604 1.73
## [100,] 0.9801 1.74
## [101,] 1.0000 1.75
matplot(x, y, type = 'l')
Notice that if you don’t tell matplot
which colors or line types to use, it will use some defaults that are designed to make it easy to tell the curves apart. You can override these as follows:
y <- cbind(y1, y2)
matplot(x, y, type = 'l', col = c("red", "blue"), lty = 1)
Plot \(sin(x)\), \(cos(x)\) and \(2sin(x + \pi/4)\) on \([0,2\pi]\).
x <- seq(from = 0, to = 2 * pi, by = 0.01)
y1 <- sin(x)
y2 <- cos(x)
y3 <- 2 * sin(x + pi/4)
y <- cbind(y1, y2, y3)
matplot(x, y, type = 'l', col = c("black", "red", "blue"), lty = 1)
R has a collection of built-in functions for all of the random variables we’ll study in this course. All of these functions follow a consistent naming convention: the prefix r
(“random”) is used for generating random draws, p
(“probability”) is for a CDF , d
(“density”) is for a pmf or pdf, and q
is for quantiles (the inverse of a CDF). All you need to know to find the appropriate command for a given random variable is the necessary suffix. For the Binomial RV it’s binom
, leading to the R commands rbinom
(simulate draws from a binomial RV), pbinom
(CDF of a binomial RV) and dbinom
(pmf of a Binomial RV). We won’t worry about qbinom
.
Here is a table of all of the many distributions available to you in R. This is just a reference – we won’t need all of these in this course.
rbinom
This function takes 3 arguments: n
is the number of random draws we want, size
is what we call \(n\) in class, namely the underlying number of Bernoulli trials (coin flips) that we’re adding together to create the Binomial RV, and prob
is \(p\) from class. So, to get 20 draws from a Binomial\((n=10, p = 1/2)\) random variable, I would use the following command:
rbinom(20, size = 10, prob = 1/2)
## [1] 5 5 3 5 2 6 9 8 3 6 5 3 4 5 6 4 6 6 3 7
Remember, the support of this random variable is \(\{0, 1, 2, ..., 10\}\) so each draw is a number between 0 and 10. In class we learned that the mean of a Binomial RV is \(np\) while the variance is \(np(1-p)\). We can check this for particular examples using Monte Carlo Simulation. For this example, we don’t need replicate
sims <- rbinom(100000, size = 10, prob = 1/2)
mean(sims) - (10 * 1/2)
## [1] -0.00357
var(sims) - (10 * 1/2 * 1/2)
## [1] -0.01457789
Generate 100000 draws from a Binomial\((n = 20, p = 0.9)\) random variable and use them to check the formula for the mean and variance from class.
sims <- rbinom(100000, size = 20, prob = 0.9)
mean(sims) - (20 * 0.9)
## [1] -0.00201
var(sims) - (20 * 0.1 * 0.9)
## [1] 0.006784028
dbinom
Like rbinom
, the function dbinom
takes arguments size
and prob
and they have the same meaning. Rather than taking n
as its first argument, however, it takes x
, where x
is the realization at which we want to evaluate the pmf. Specifically, dbinom(x, size, prob)
calculates \[ {n \choose x} p^x (1 - p)^{n-x} \]
Remember: x
must be in the support. Here’s an example:
dbinom(7, size = 10, prob = 0.8)
## [1] 0.2013266
choose(10, 7) * (0.8)^7 * (0.2)^3
## [1] 0.2013266
so we get exactly the same answer with much less typing by using dbinom
. Even better, dbinom
can accept a vector for x
. For example, we can calculate the pmf of a Binomial\((n = 10, p = 0.5)\) random variable instantly at every point in the support as follows:
support <- 0:10
p.x <- dbinom(support, size = 10, prob = 0.5)
p.x
## [1] 0.0009765625 0.0097656250 0.0439453125 0.1171875000 0.2050781250
## [6] 0.2460937500 0.2050781250 0.1171875000 0.0439453125 0.0097656250
## [11] 0.0009765625
This is how I made the plots of the Binomial pmf that I showed you in class:
plot(support, p.x)
To get bars rather than points, set type = 'h'
plot(0:10, p.x, type = 'h', xlab = 'x', ylab = 'p(x)',
main = 'pmf for a Bernoulli(n = 10, p = 0.5) RV')
Plot the pmf of a Binomial\((n = 20, p = 0.65)\) Random Variable.
support <- 0:20
p.x <- dbinom(support, size = 20, prob = 0.65)
plot(support, p.x, type = 'h', xlab = 'x',
ylab = 'p(x)', main = 'pmf for a Bernoulli(n = 20, p = 0.65) RV')
pbinom
Recall from class that the CDF of a Random Variable \(X\) is defined as follows:
\[F(x_0) = P(X \leq x_0)\]
where \(x_0\) is any real number. We showed that for a discrete RV we can get the CDF from the pdf as follows:
\[F(x_0) = \sum_{x \leq x_0} p(x)\]
where \(x \leq x_0\) is shorthand to denote all realizations \(x\) in the support of \(X\) that are less than the threshold \(x_0\). Plugging in the pmf for a Binomial RV, we have
\[F(x_0) = \sum_{x \leq x_0} {n\choose x} p^x (1 - p)^{n-x}\]
Fortunately we don’t have to calculate this by hand since there’s an R function called pbinom
to do it for us. Like dbinom
and rbinom
it takes arguments size
and prob
and they mean the same thing. Its first argument, however, is the threshold which R calls q
rather than \(x_0\).
Let’s test it out for a Binomial\((n = 20, p = 0.3)\) random variable. We’ll do this two ways: first the “hard way” using sum
and dbinom
and then the “easy way” using pbinom
\[F(7.4) = P(X \leq 7.4) = \sum_{x \leq 7.4} p(x) = \sum_{x = 0}^7 {n\choose x} p^x (1 - p)^{n-x}\]
sum(dbinom(0:7, size = 20, prob = 0.3))
## [1] 0.7722718
pbinom(7.4, size = 20, prob = 0.3)
## [1] 0.7722718
Notice that we get exactly the same answer.
Evaluate the CDF of a Binomial\((n = 50, p = 0.5)\) at the threshold \(x_0 = 24.5\) two ways: first using sum
and dbinom
and then using pbinom
.
sum(dbinom(0:24, size = 50, prob = 0.5))
## [1] 0.4438624
pbinom(24.5, size = 50, prob = 0.5)
## [1] 0.4438624
It turns out that, like dbinom
, pbinom
can accept a vector as its first argument. This is how we can make plots of the CDF of a Binomial RV. For example:
x <- seq(from = -1, to = 10, by = 0.01)
y <- pbinom(x, size = 10, prob = 0.5)
plot(x, y, ylab = 'F(x)')
Note how the function “jumps” at each realization. To make this plot look nicer, we can plot “stair-steps” rather than points by setting type = s
plot(x, y, ylab = 'F(x)', type = 's')
Plot the CDFs of three Binomial RVs on the same graph, each with \(n = 20\). For the first set \(p = 0.2\), for the second the second set \(p = 0.5\) and for the third set \(p=0.8\). Explain how and why the CDFS differ.
x <- seq(from = -1, to = 21, by = 0.01)
y1 <- pbinom(x, size = 20, prob = 0.2)
y2 <- pbinom(x, size = 20, prob = 0.5)
y3 <- pbinom(x, size = 20, prob = 0.8)
y <- cbind(y1, y2, y3)
matplot(x, y, col = c("black", "blue", "red"), type = 's', lty = 1, ylab = "F(x)")
Now that you know how to use R to make plots and calculate various probabilities for Binomial Random Variables, let’s test our your skills on some more interesting examples
Dr. Horrible decided to give his Econ 1 students a pop quiz on Advanced Quantum Mechanics. Since he isn’t completely unreasonable he made the quiz True-or-False. Since they don’t know any Quantum Physics, Dr. Horrible’s students guess randomly on each of the 20 questions.
#Part 1
plot(0:20, dbinom(0:20, size = 20, prob = 0.5), type = 'h', ylab = 'p(x)')
#Part 2
sum(dbinom(12:20, size = 20, prob = 0.5))
## [1] 0.2517223
#Part 3
sum(dbinom(18:20, size = 20, prob = 0.5))
## [1] 0.0002012253
#Part 4
250 * sum(dbinom(12:20, size = 20, prob = 0.5))
## [1] 62.93058
250 * sum(dbinom(18:20, size = 20, prob = 0.5))
## [1] 0.05030632
Suppose you carry out a poll of 200 randomly selected Penn male undergraduates to find out the proportion who prefer boxers to briefs.
#Binomial(n = 200, p) random variable where p is the unknown proportion who prefer boxers.
1-sum(dbinom((200*.45+1):(200*.65-1),size=200,p=.5))
## [1] 0.08949528
There is a discrete random variable that we did not cover in class called the Poisson. It has one parameter \(\lambda\) which is sometimes called the “rate parameter”. Essentially, the Poisson is what you get in the limit if you take a Binomial\((n,p)\) random variable and let \(n \rightarrow \infty\) while holding \(np\) fixed.
The R functions for the Poisson RV that correspond to rbinom
, dbinom
and pbinom
for the Binomial RV are rpois
, dpois
and ppois
. In place of the parameters size
and prob
that we needed for the Binomial, these functions for the Poisson have the single parameter lambda
. Otherwise, they work in the same way.
rpois
to simulate 10000 draws from each of the following Possion RVs and store the results: \(\lambda = 1, \lambda = 5, \lambda = 10, \lambda = 15\).#Exercise 11-1
n <- 10000
sim1 <- rpois(n, lambda = 1)
sim5 <- rpois(n, lambda = 5)
sim10 <- rpois(n, lambda = 10)
sim15 <- rpois(n, lambda = 15)
#Exercise 11-2
mean(sim1)
## [1] 1.0146
mean(sim5)
## [1] 5.0275
mean(sim10)
## [1] 9.9718
mean(sim15)
## [1] 14.9692
#Exercise 11-3
var(sim1)
## [1] 1.016488
var(sim5)
## [1] 5.047849
var(sim10)
## [1] 10.04881
var(sim15)
## [1] 15.16577
#Exercise 11-4
range(sim1)
## [1] 0 7
table(sim1)
## sim1
## 0 1 2 3 4 5 6 7
## 3625 3705 1807 665 163 31 3 1
range(sim5)
## [1] 0 15
table(sim5)
## sim5
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 68 350 873 1309 1724 1735 1476 1119 661 370 166 96 35 15 2
## 15
## 1
range(sim10)
## [1] 1 23
table(sim10)
## sim10
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 3 31 67 183 380 699 910 1106 1229 1263 1124 911 734 557 319
## 16 17 18 19 20 21 22 23
## 209 132 71 30 28 9 4 1
range(sim15)
## [1] 3 33
table(sim15)
## sim15
## 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 4 6 29 54 102 202 330 479 704 848 911 986 1016 947 844
## 18 19 20 21 22 23 24 25 26 27 28 29 30 31 33
## 704 589 464 276 188 120 88 50 32 12 8 2 2 2 1
#Exercise 11-5
x <- -1:15
plot(x, dpois(x, lambda = 2), type = 'h', ylab = 'p(x)')
plot(x, ppois(x, lambda = 2), type = 's', ylab = 'F(x)')