This tutorial has two parts. In the first part you’ll learn how to plot functions in R. In the second part you’ll learn how we can use R to study the Binomial random variable.

Plotting Curves in R

You’ve already seen a number of examples of the plot function in R. So far we’ve mainly plotted points but we can actually use the same command to plot functions. The basic idea is to set up the x and y coordinates of some points that are close enough together that it looks like we’ve plotted a smooth curve. Everything on a computer is actually discrete: your eye is simply fooled into thinking that things look continuous. All curves are actually many tiny line segments!

Simple Example

Let’s start by plotting a few points on the curve \(f(x) = x^2\)

x <- seq(from = -1, to = 1, by = 0.5)
y <- x^2
plot(x, y)

Those points aren’t “dense” enough to approximate the whole curve. Let’s try making a finer plot:

x <- seq(from = -1, to = 1, by = 0.1)
y <- x^2
plot(x, y)

This looks better, but how about even finer?

x <- seq(from = -1, to = 1, by = 0.01)
y <- x^2
plot(x, y)

This is more like what we’re after, but we’d rather have a smooth curve rather than all those little “dots.” We can do this as follows:

plot(x, y, type = "l")

Exercise #1

Plot \(f(x) = x^3\) on \([-2,2]\).

x <- seq(from = -2, to = 2, by = 0.01)
y <- x^3
plot(x, y, type = 'l')

Exercise #2

Plot \(f(x) = \log(x)\) on \([0.5, 1.5]\).

x <- seq(from = 0.5, to = 1.5, by = 0.01)
y <- log(x)
plot(x, y, type = 'l')

Exercise #3

Plot \(f(x) = \sin(x)\) on \([0, 6\pi]\).

x <- seq(from = 0, to = 6 * pi, by = 0.01)
y <- sin(x)
plot(x, y, type = 'l')

Multiple Curves

The command plot creates a new plot. To add to the existing plot, we could use the command lines:

x <- seq(from = 0, to = 1, by = 0.01)
y1 <- x^2
y2 <- x
plot(x, y1, col = 'blue', type = 'l')
lines(x, y2, col = 'red')

But using lines can get tricky since the curve we add might not fit inside the plot we’ve already created:

x <- seq(from = 0, to = 1, by = 0.01)
y1 <- x^2
y2 <- x + 0.75
plot(x, y1, col = 'blue', type = 'l')
lines(x, y2, col = 'red')

We could use the xlim and ylim arguments to the first call to plot to take care of this, but this can be a pain. The easiest way to avoid this is to use matplot rather than plot followed by lines, since matplot will figure out the overall x- and y-axis limits automatically. To use this command we bind the y-coordinates of our two (or more) curves into a matrix. We could redo the preceding example using matplot as follows.

The function to bind vectors into a matrix is cbind – it will “glue” its inputs together side-by-side vertically to create a matrix where each column is the vectors you input:

y <- cbind(y1, y2)
y
##            y1   y2
##   [1,] 0.0000 0.75
##   [2,] 0.0001 0.76
##   [3,] 0.0004 0.77
##   [4,] 0.0009 0.78
##   [5,] 0.0016 0.79
##   [6,] 0.0025 0.80
##   [7,] 0.0036 0.81
##   [8,] 0.0049 0.82
##   [9,] 0.0064 0.83
##  [10,] 0.0081 0.84
##  [11,] 0.0100 0.85
##  [12,] 0.0121 0.86
##  [13,] 0.0144 0.87
##  [14,] 0.0169 0.88
##  [15,] 0.0196 0.89
##  [16,] 0.0225 0.90
##  [17,] 0.0256 0.91
##  [18,] 0.0289 0.92
##  [19,] 0.0324 0.93
##  [20,] 0.0361 0.94
##  [21,] 0.0400 0.95
##  [22,] 0.0441 0.96
##  [23,] 0.0484 0.97
##  [24,] 0.0529 0.98
##  [25,] 0.0576 0.99
##  [26,] 0.0625 1.00
##  [27,] 0.0676 1.01
##  [28,] 0.0729 1.02
##  [29,] 0.0784 1.03
##  [30,] 0.0841 1.04
##  [31,] 0.0900 1.05
##  [32,] 0.0961 1.06
##  [33,] 0.1024 1.07
##  [34,] 0.1089 1.08
##  [35,] 0.1156 1.09
##  [36,] 0.1225 1.10
##  [37,] 0.1296 1.11
##  [38,] 0.1369 1.12
##  [39,] 0.1444 1.13
##  [40,] 0.1521 1.14
##  [41,] 0.1600 1.15
##  [42,] 0.1681 1.16
##  [43,] 0.1764 1.17
##  [44,] 0.1849 1.18
##  [45,] 0.1936 1.19
##  [46,] 0.2025 1.20
##  [47,] 0.2116 1.21
##  [48,] 0.2209 1.22
##  [49,] 0.2304 1.23
##  [50,] 0.2401 1.24
##  [51,] 0.2500 1.25
##  [52,] 0.2601 1.26
##  [53,] 0.2704 1.27
##  [54,] 0.2809 1.28
##  [55,] 0.2916 1.29
##  [56,] 0.3025 1.30
##  [57,] 0.3136 1.31
##  [58,] 0.3249 1.32
##  [59,] 0.3364 1.33
##  [60,] 0.3481 1.34
##  [61,] 0.3600 1.35
##  [62,] 0.3721 1.36
##  [63,] 0.3844 1.37
##  [64,] 0.3969 1.38
##  [65,] 0.4096 1.39
##  [66,] 0.4225 1.40
##  [67,] 0.4356 1.41
##  [68,] 0.4489 1.42
##  [69,] 0.4624 1.43
##  [70,] 0.4761 1.44
##  [71,] 0.4900 1.45
##  [72,] 0.5041 1.46
##  [73,] 0.5184 1.47
##  [74,] 0.5329 1.48
##  [75,] 0.5476 1.49
##  [76,] 0.5625 1.50
##  [77,] 0.5776 1.51
##  [78,] 0.5929 1.52
##  [79,] 0.6084 1.53
##  [80,] 0.6241 1.54
##  [81,] 0.6400 1.55
##  [82,] 0.6561 1.56
##  [83,] 0.6724 1.57
##  [84,] 0.6889 1.58
##  [85,] 0.7056 1.59
##  [86,] 0.7225 1.60
##  [87,] 0.7396 1.61
##  [88,] 0.7569 1.62
##  [89,] 0.7744 1.63
##  [90,] 0.7921 1.64
##  [91,] 0.8100 1.65
##  [92,] 0.8281 1.66
##  [93,] 0.8464 1.67
##  [94,] 0.8649 1.68
##  [95,] 0.8836 1.69
##  [96,] 0.9025 1.70
##  [97,] 0.9216 1.71
##  [98,] 0.9409 1.72
##  [99,] 0.9604 1.73
## [100,] 0.9801 1.74
## [101,] 1.0000 1.75
matplot(x, y, type = 'l')

Notice that if you don’t tell matplot which colors or line types to use, it will use some defaults that are designed to make it easy to tell the curves apart. You can override these as follows:

y <- cbind(y1, y2)
matplot(x, y, type = 'l', col = c("red", "blue"), lty = 1)

Exercise #4

Plot \(sin(x)\), \(cos(x)\) and \(2sin(x + \pi/4)\) on \([0,2\pi]\).

x <- seq(from = 0, to = 2 * pi, by = 0.01)
y1 <- sin(x)
y2 <- cos(x)
y3 <- 2 * sin(x + pi/4)
y <- cbind(y1, y2, y3)
matplot(x, y, type = 'l', col = c("black", "red", "blue"), lty = 1)

The Binomal RV

R has a collection of built-in functions for all of the random variables we’ll study in this course. All of these functions follow a consistent naming convention: the prefix r (“random”) is used for generating random draws, p (“probability”) is for a CDF , d (“density”) is for a pmf or pdf, and q is for quantiles (the inverse of a CDF). All you need to know to find the appropriate command for a given random variable is the necessary suffix. For the Binomial RV it’s binom, leading to the R commands rbinom (simulate draws from a binomial RV), pbinom (CDF of a binomial RV) and dbinom (pmf of a Binomial RV). We won’t worry about qbinom.

Here is a table of all of the many distributions available to you in R. This is just a reference – we won’t need all of these in this course.

List of native distributions from built-in stats package
Distribution RNG Other Parameters Wikipedia
Beta rbeta shape1, shape2, ncp https://en.wikipedia.org/wiki/Beta_distribution
Binomial rbinom size, prob https://en.wikipedia.org/wiki/Binomial_distribution
Cauchy rcauchy location, scale https://en.wikipedia.org/wiki/Cauchy_distribution
Chi-Squared rchisq df, ncp https://en.wikipedia.org/wiki/Chi-squared_distribution
Exponential rexp rate https://en.wikipedia.org/wiki/Exponential_distribution
F rf df1, df2, ncp https://en.wikipedia.org/wiki/F-distribution
Gamma rgamma shape, rate, scale https://en.wikipedia.org/wiki/Gamma_distribution
Geometric rgeom prob https://en.wikipedia.org/wiki/Geometric_distribution
Hypergeometric rhyper m, n, k https://en.wikipedia.org/wiki/Hypergeometric_distribution
Log-Normal rlnorm meanlog, sdlog https://en.wikipedia.org/wiki/Log-normal_distribution
Logistic rlogis location, scale https://en.wikipedia.org/wiki/Logistic_distribution
Multinomial rmultinom size, prob https://en.wikipedia.org/wiki/Multinomial_distribution
Negative Binomial rnbinom size, prob, mu https://en.wikipedia.org/wiki/Negative_binomial_distribution
Poisson rpois lambda https://en.wikipedia.org/wiki/Poisson_distribution
Wilcoxon Sign Ranked Statistic rsignrank* n https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test
Student t rt df, ncp https://en.wikipedia.org/wiki/Student%27s_t-distribution
Weibull rweibull shape, scale https://en.wikipedia.org/wiki/Weibull_distribution
Wilcoxon Sign Ranked Statistic rwilcox* m, n https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test
Wishart rWishart df, Sigma https://en.wikipedia.org/wiki/Wishart_distribution

Simulate Binomial Draws: rbinom

This function takes 3 arguments: n is the number of random draws we want, size is what we call \(n\) in class, namely the underlying number of Bernoulli trials (coin flips) that we’re adding together to create the Binomial RV, and prob is \(p\) from class. So, to get 20 draws from a Binomial\((n=10, p = 1/2)\) random variable, I would use the following command:

rbinom(20, size = 10, prob = 1/2)
##  [1] 5 5 3 5 2 6 9 8 3 6 5 3 4 5 6 4 6 6 3 7

Remember, the support of this random variable is \(\{0, 1, 2, ..., 10\}\) so each draw is a number between 0 and 10. In class we learned that the mean of a Binomial RV is \(np\) while the variance is \(np(1-p)\). We can check this for particular examples using Monte Carlo Simulation. For this example, we don’t need replicate

sims <- rbinom(100000, size = 10, prob = 1/2)
mean(sims) - (10 * 1/2)
## [1] -0.00357
var(sims) - (10 * 1/2 * 1/2)
## [1] -0.01457789

Exercise #5

Generate 100000 draws from a Binomial\((n = 20, p = 0.9)\) random variable and use them to check the formula for the mean and variance from class.

sims <- rbinom(100000, size = 20, prob = 0.9)
mean(sims) - (20 * 0.9)
## [1] -0.00201
var(sims) - (20 * 0.1 * 0.9)
## [1] 0.006784028

Binomial pmf: dbinom

Like rbinom, the function dbinom takes arguments size and prob and they have the same meaning. Rather than taking n as its first argument, however, it takes x, where x is the realization at which we want to evaluate the pmf. Specifically, dbinom(x, size, prob) calculates \[ {n \choose x} p^x (1 - p)^{n-x} \]

Remember: x must be in the support. Here’s an example:

dbinom(7, size = 10, prob = 0.8)
## [1] 0.2013266
choose(10, 7) * (0.8)^7 * (0.2)^3
## [1] 0.2013266

so we get exactly the same answer with much less typing by using dbinom. Even better, dbinom can accept a vector for x. For example, we can calculate the pmf of a Binomial\((n = 10, p = 0.5)\) random variable instantly at every point in the support as follows:

support <- 0:10
p.x <- dbinom(support, size = 10, prob = 0.5)
p.x
##  [1] 0.0009765625 0.0097656250 0.0439453125 0.1171875000 0.2050781250
##  [6] 0.2460937500 0.2050781250 0.1171875000 0.0439453125 0.0097656250
## [11] 0.0009765625

This is how I made the plots of the Binomial pmf that I showed you in class:

plot(support, p.x)

To get bars rather than points, set type = 'h'

plot(0:10, p.x, type = 'h', xlab = 'x', ylab = 'p(x)',
     main = 'pmf for a Bernoulli(n = 10, p = 0.5) RV')

Exercise #6

Plot the pmf of a Binomial\((n = 20, p = 0.65)\) Random Variable.

support <- 0:20
p.x <- dbinom(support, size = 20, prob = 0.65)
plot(support, p.x, type = 'h', xlab = 'x',
     ylab = 'p(x)', main = 'pmf for a Bernoulli(n = 20, p = 0.65) RV')

Binomial CDF: pbinom

Recall from class that the CDF of a Random Variable \(X\) is defined as follows:

\[F(x_0) = P(X \leq x_0)\]

where \(x_0\) is any real number. We showed that for a discrete RV we can get the CDF from the pdf as follows:

\[F(x_0) = \sum_{x \leq x_0} p(x)\]

where \(x \leq x_0\) is shorthand to denote all realizations \(x\) in the support of \(X\) that are less than the threshold \(x_0\). Plugging in the pmf for a Binomial RV, we have

\[F(x_0) = \sum_{x \leq x_0} {n\choose x} p^x (1 - p)^{n-x}\]

Fortunately we don’t have to calculate this by hand since there’s an R function called pbinom to do it for us. Like dbinom and rbinom it takes arguments size and prob and they mean the same thing. Its first argument, however, is the threshold which R calls q rather than \(x_0\).

Let’s test it out for a Binomial\((n = 20, p = 0.3)\) random variable. We’ll do this two ways: first the “hard way” using sum and dbinom and then the “easy way” using pbinom

\[F(7.4) = P(X \leq 7.4) = \sum_{x \leq 7.4} p(x) = \sum_{x = 0}^7 {n\choose x} p^x (1 - p)^{n-x}\]

sum(dbinom(0:7, size = 20, prob = 0.3))
## [1] 0.7722718
pbinom(7.4, size = 20, prob = 0.3)
## [1] 0.7722718

Notice that we get exactly the same answer.

Exercise #7

Evaluate the CDF of a Binomial\((n = 50, p = 0.5)\) at the threshold \(x_0 = 24.5\) two ways: first using sum and dbinom and then using pbinom.

sum(dbinom(0:24, size = 50, prob = 0.5))
## [1] 0.4438624
pbinom(24.5, size = 50, prob = 0.5)
## [1] 0.4438624

It turns out that, like dbinom, pbinom can accept a vector as its first argument. This is how we can make plots of the CDF of a Binomial RV. For example:

x <- seq(from = -1, to = 10, by = 0.01)
y <- pbinom(x, size = 10, prob = 0.5)
plot(x, y, ylab = 'F(x)')

Note how the function “jumps” at each realization. To make this plot look nicer, we can plot “stair-steps” rather than points by setting type = s

plot(x, y, ylab = 'F(x)', type = 's')

Exercise #8

Plot the CDFs of three Binomial RVs on the same graph, each with \(n = 20\). For the first set \(p = 0.2\), for the second the second set \(p = 0.5\) and for the third set \(p=0.8\). Explain how and why the CDFS differ.

x <- seq(from = -1, to = 21, by = 0.01)
y1 <- pbinom(x, size = 20, prob = 0.2)
y2 <- pbinom(x, size = 20, prob = 0.5)
y3 <- pbinom(x, size = 20, prob = 0.8)
y <- cbind(y1, y2, y3)
matplot(x, y, col = c("black", "blue", "red"), type = 's', lty = 1, ylab = "F(x)")

Now that you know how to use R to make plots and calculate various probabilities for Binomial Random Variables, let’s test our your skills on some more interesting examples

Exercise #9

Dr. Horrible decided to give his Econ 1 students a pop quiz on Advanced Quantum Mechanics. Since he isn’t completely unreasonable he made the quiz True-or-False. Since they don’t know any Quantum Physics, Dr. Horrible’s students guess randomly on each of the 20 questions.

  1. An individual student’s score on this quiz can be modeled as the realization of random variable. What random variable? Plot its pmf.
  2. Suppose that a passing grade on the quiz is a 60%. What is the probablity that a given student passes?
  3. Suppose that anything over 90% is an A. What is the probability that an individual student gets an A?
  4. If the clas has 250 students, approximately how many will pass the quiz? Approximately how many will get an A?
#Part 1
plot(0:20, dbinom(0:20, size = 20, prob = 0.5), type = 'h', ylab = 'p(x)')

#Part 2
sum(dbinom(12:20, size = 20, prob = 0.5))
## [1] 0.2517223
#Part 3
sum(dbinom(18:20, size = 20, prob = 0.5))
## [1] 0.0002012253
#Part 4
250 * sum(dbinom(12:20, size = 20, prob = 0.5))
## [1] 62.93058
250 * sum(dbinom(18:20, size = 20, prob = 0.5))
## [1] 0.05030632

Exercise #10

Suppose you carry out a poll of 200 randomly selected Penn male undergraduates to find out the proportion who prefer boxers to briefs.

  1. The number of people in your sample who prefer boxers can be viewed as the realization of a random variable. What random variable?
  2. Suppose that the true proportion who prefer boxers to briefs among all mall Penn undergraduates is 0.5. What is the probability that the proportion in your sample who prefer boxers will fall outside the range \((0.45, 0.65)\)?
#Binomial(n = 200, p) random variable where p is the unknown proportion who prefer boxers.
1-sum(dbinom((200*.45+1):(200*.65-1),size=200,p=.5))
## [1] 0.08949528

Exercise #11

There is a discrete random variable that we did not cover in class called the Poisson. It has one parameter \(\lambda\) which is sometimes called the “rate parameter”. Essentially, the Poisson is what you get in the limit if you take a Binomial\((n,p)\) random variable and let \(n \rightarrow \infty\) while holding \(np\) fixed.

The R functions for the Poisson RV that correspond to rbinom, dbinom and pbinom for the Binomial RV are rpois, dpois and ppois. In place of the parameters size and prob that we needed for the Binomial, these functions for the Poisson have the single parameter lambda. Otherwise, they work in the same way.

  1. Use rpois to simulate 10000 draws from each of the following Possion RVs and store the results: \(\lambda = 1, \lambda = 5, \lambda = 10, \lambda = 15\).
  2. Using your simulations from 1, what do you think the mean of a Poisson RV is in terms of \(\lambda\)?
  3. Using your simulations from 1, what do you think the variance of a Poisson RV is in terms of \(\lambda\)?
  4. Based on my description above and your simulation draws, what do you think the support of a Poisson RV is?
  5. Plot the pmf and CDF of a Poisson\(\lambda = 2\) random variable on the range \([-1, 15]\).
#Exercise 11-1
n <- 10000
sim1 <-  rpois(n, lambda = 1)
sim5 <- rpois(n, lambda = 5)
sim10 <- rpois(n, lambda = 10)
sim15 <- rpois(n, lambda = 15)
#Exercise 11-2
mean(sim1)
## [1] 1.0146
mean(sim5)
## [1] 5.0275
mean(sim10)
## [1] 9.9718
mean(sim15)
## [1] 14.9692
#Exercise 11-3
var(sim1)
## [1] 1.016488
var(sim5)
## [1] 5.047849
var(sim10)
## [1] 10.04881
var(sim15)
## [1] 15.16577
#Exercise 11-4
range(sim1)
## [1] 0 7
table(sim1)
## sim1
##    0    1    2    3    4    5    6    7 
## 3625 3705 1807  665  163   31    3    1
range(sim5)
## [1]  0 15
table(sim5)
## sim5
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
##   68  350  873 1309 1724 1735 1476 1119  661  370  166   96   35   15    2 
##   15 
##    1
range(sim10)
## [1]  1 23
table(sim10)
## sim10
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
##    3   31   67  183  380  699  910 1106 1229 1263 1124  911  734  557  319 
##   16   17   18   19   20   21   22   23 
##  209  132   71   30   28    9    4    1
range(sim15)
## [1]  3 33
table(sim15)
## sim15
##    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17 
##    4    6   29   54  102  202  330  479  704  848  911  986 1016  947  844 
##   18   19   20   21   22   23   24   25   26   27   28   29   30   31   33 
##  704  589  464  276  188  120   88   50   32   12    8    2    2    2    1
#Exercise 11-5
x <- -1:15
plot(x, dpois(x, lambda = 2), type = 'h', ylab = 'p(x)')

plot(x, ppois(x, lambda = 2), type = 's', ylab = 'F(x)')