This tutorial has two parts. The first part shows you how to calculate the remaining summary statistics from the lectures and introduces a few new plotting commands. The second part describes one of the most fundamental parts of R: creating your own functions. Rather than appearing at the end, several exercises are scattered throughout the parts of this tutorial.
First we’ll load the survey data from R Tutorial #2.
Like before we’ll only need a subset of the columns available; rather than go to the hassle of deleting these columns afterwards, we can specify which columns to keep simultaneously with reading in the data using the select
argument to fread
. (There is also a drop
argument for the reverse – specifying columns to exclude)
library(data.table)
survey = fread("http://www.ditraglia.com/econ103/old_survey.csv",
select = c("handedness", "height", "handspan"))
cor
This command calculates sample correlation coefficients. If you pass it two vectors of the same length it returns the correlation between them. If you feed it a data.table
or matrix
it returns a matrix containing all pairwise correlations between the columns (we haven’t covered matrices in R since you’ll only need to interpret them as output for this class – for a primer on handling matrix
objects, see here). Unlike the univariate summary statistics, in which we used na.rm = TRUE
to ignore missing observations, for cor
we use the argument use = "complete.obs"
. The reason for this difference is that cor
can handle a matrix
or data.table
as its input. In this case there are actually many different ways of handling missing observations. For more details, see ?cor
. Setting use = "complete.obs"
removes any rows with missing observations before proceeding.
survey[ , cor(handspan, height, use = "complete.obs")]
## [1] 0.6042423
Of course, we could use the approach we saw in R Tutorial #2 as well; whichever feels more natural:
survey[!is.na(handspan) & !is.na(height), cor(handspan, height)]
## [1] 0.6042423
We can also feed the full data.table
to cor
like so:
cor(survey, use = "complete.obs")
## handedness height handspan
## handedness 1.00000000 0.1125605 -0.01719909
## height 0.11256055 1.0000000 0.60942759
## handspan -0.01719909 0.6094276 1.00000000
#alternatively, there's the na.omit
# function, which does just what it says:
cor(na.omit(survey))
## handedness height handspan
## handedness 1.00000000 0.1125605 -0.01719909
## height 0.11256055 1.0000000 0.60942759
## handspan -0.01719909 0.6094276 1.00000000
We see that there is a large positive correlation between height
and handspan
and a slight positive correlation between height
and handedness.
There’s basically no correlation between handedness
and handspan
. The preceding matrix has ones on its main diagonal since the correlation between any variable and itself is identically one. (A good exercise for extra practice would be to prove this assertion using the formula for correlation from class. It’s not very difficult.)
If you look carefully, you’ll notice that the correlation between height
and handspan
is slightly different when calculated at the same time as all the other correlations. This is because of the way that use = "complete.obs"
drops rows. In this case, it indicates that there is at least one observation in our dataset for which we have height
and handspan
but not handedness
.
For practice, we can find this with judicious use of is.na
and logical operators:
survey[is.na(handedness) & !is.na(height) & !is.na(handspan)]
## handedness height handspan
## 1: NA 70 19.5
cov
This command works just like cor
but returns covariances rather than correlations:
survey[ , cov(handspan, height, use = "complete.obs")]
## [1] 5.910786
cov(survey, use = "complete.obs")
## handedness height handspan
## handedness 0.1685042 0.206959 -0.0155499
## height 0.2069590 20.062517 6.0121751
## handspan -0.0155499 6.012175 4.8510260
cov(na.omit(survey))
## handedness height handspan
## handedness 0.1685042 0.206959 -0.0155499
## height 0.2069590 20.062517 6.0121751
## handspan -0.0155499 6.012175 4.8510260
lm
and abline
This command stands for “_l_inear _m_odel" and is R’s general-purpose regression command. Its syntax is similar to that of boxplot
from R Tutorial #2 in that we use a tilde (~
) to indicate a functional relationship. Here the variable to the left of the tilde is the “y” variable, while the variable to the right is the “x” variable.
survey[ , lm(height ~ handspan)]
##
## Call:
## lm(formula = height ~ handspan)
##
## Coefficients:
## (Intercept) handspan
## 42.251 1.229
Remember: unlike correlation and covariance, regression is not symmetric:
survey[ , lm(handspan ~ height)]
##
## Call:
## lm(formula = handspan ~ height)
##
## Coefficients:
## (Intercept) height
## 0.5305 0.2970
It turns out that we can use the same syntax with the command plot
:
survey[ , plot(handspan ~ height)]
To add the regression line, we follow the plot
command with the function abline
like so:
survey[ , plot(handspan ~ height)]
survey[ , abline(lm(handspan ~ height))]
Note that abline
can only be used after you’ve already made a plot. It adds a line to the existing plot rather than making a plot of its own.
You can also use abline
to plot different kinds of lines. For example, we can show that the regression line goes through the means of the data as follows:
survey[ , {
plot(handspan ~ height)
abline(lm(handspan ~ height))
abline(v = mean(height, na.rm = TRUE),
h = mean(handspan, na.rm = TRUE),
col = 'red', lty = 2)
}]