This tutorial has two parts. The first part shows you how to calculate the remaining summary statistics from the lectures and introduces a few new plotting commands. The second part describes one of the most fundamental parts of R: creating your own functions. Rather than appearing at the end, several exercises are scattered throughout the parts of this tutorial.
First we’ll load the survey data from R Tutorial #2.
Like before we’ll only need a subset of the columns available; rather than go to the hassle of deleting these columns afterwards, we can specify which columns to keep simultaneously with reading in the data using the select
argument to fread
. (There is also a drop
argument for the reverse – specifying columns to exclude)
library(data.table)
survey = fread("http://www.ditraglia.com/econ103/old_survey.csv",
select = c("handedness", "height", "handspan"))
cor
This command calculates sample correlation coefficients. If you pass it two vectors of the same length it returns the correlation between them. If you feed it a data.table
or matrix
it returns a matrix containing all pairwise correlations between the columns (we haven’t covered matrices in R since you’ll only need to interpret them as output for this class – for a primer on handling matrix
objects, see here). Unlike the univariate summary statistics, in which we used na.rm = TRUE
to ignore missing observations, for cor
we use the argument use = "complete.obs"
. The reason for this difference is that cor
can handle a matrix
or data.table
as its input. In this case there are actually many different ways of handling missing observations. For more details, see ?cor
. Setting use = "complete.obs"
removes any rows with missing observations before proceeding.
survey[ , cor(handspan, height, use = "complete.obs")]
## [1] 0.6042423
Of course, we could use the approach we saw in R Tutorial #2 as well; whichever feels more natural:
survey[!is.na(handspan) & !is.na(height), cor(handspan, height)]
## [1] 0.6042423
We can also feed the full data.table
to cor
like so:
cor(survey, use = "complete.obs")
## handedness height handspan
## handedness 1.00000000 0.1125605 -0.01719909
## height 0.11256055 1.0000000 0.60942759
## handspan -0.01719909 0.6094276 1.00000000
#alternatively, there's the na.omit
# function, which does just what it says:
cor(na.omit(survey))
## handedness height handspan
## handedness 1.00000000 0.1125605 -0.01719909
## height 0.11256055 1.0000000 0.60942759
## handspan -0.01719909 0.6094276 1.00000000
We see that there is a large positive correlation between height
and handspan
and a slight positive correlation between height
and handedness.
There’s basically no correlation between handedness
and handspan
. The preceding matrix has ones on its main diagonal since the correlation between any variable and itself is identically one. (A good exercise for extra practice would be to prove this assertion using the formula for correlation from class. It’s not very difficult.)
If you look carefully, you’ll notice that the correlation between height
and handspan
is slightly different when calculated at the same time as all the other correlations. This is because of the way that use = "complete.obs"
drops rows. In this case, it indicates that there is at least one observation in our dataset for which we have height
and handspan
but not handedness
.
For practice, we can find this with judicious use of is.na
and logical operators:
survey[is.na(handedness) & !is.na(height) & !is.na(handspan)]
## handedness height handspan
## 1: NA 70 19.5
cov
This command works just like cor
but returns covariances rather than correlations:
survey[ , cov(handspan, height, use = "complete.obs")]
## [1] 5.910786
cov(survey, use = "complete.obs")
## handedness height handspan
## handedness 0.1685042 0.206959 -0.0155499
## height 0.2069590 20.062517 6.0121751
## handspan -0.0155499 6.012175 4.8510260
cov(na.omit(survey))
## handedness height handspan
## handedness 0.1685042 0.206959 -0.0155499
## height 0.2069590 20.062517 6.0121751
## handspan -0.0155499 6.012175 4.8510260
lm
and abline
This command stands for “_l_inear _m_odel" and is R’s general-purpose regression command. Its syntax is similar to that of boxplot
from R Tutorial #2 in that we use a tilde (~
) to indicate a functional relationship. Here the variable to the left of the tilde is the “y” variable, while the variable to the right is the “x” variable.
survey[ , lm(height ~ handspan)]
##
## Call:
## lm(formula = height ~ handspan)
##
## Coefficients:
## (Intercept) handspan
## 42.251 1.229
Remember: unlike correlation and covariance, regression is not symmetric:
survey[ , lm(handspan ~ height)]
##
## Call:
## lm(formula = handspan ~ height)
##
## Coefficients:
## (Intercept) height
## 0.5305 0.2970
It turns out that we can use the same syntax with the command plot
:
survey[ , plot(handspan ~ height)]
To add the regression line, we follow the plot
command with the function abline
like so:
survey[ , plot(handspan ~ height)]
survey[ , abline(lm(handspan ~ height))]
Note that abline
can only be used after you’ve already made a plot. It adds a line to the existing plot rather than making a plot of its own.
You can also use abline
to plot different kinds of lines. For example, we can show that the regression line goes through the means of the data as follows:
survey[ , {
plot(handspan ~ height)
abline(lm(handspan ~ height))
abline(v = mean(height, na.rm = TRUE),
h = mean(handspan, na.rm = TRUE),
col = 'red', lty = 2)
}]
We’ve introduced a new piece of syntax in the j
argument – {}
. This syntax allows us to run multiple commands within one pair of square brackets – here, we’re running plot
and then lm
twice. We’ll see {}
again below when we learn about writing functions; for now, observe that it’s convenient not to have to write survey[]
repeatedly.
As to the call to abline
, we’ve supplied two new arguments. v
indicates that we want a vertical line at the specified number(s), while h
indicates that we want a horizontal line at the specified number(s).
We’ve also introduced some new plotting commands – lty
stands for _l_ine _ty_pe. lty = 2
makes a dashed line. A standard line (the default) has lty = 1
. Try some other line types and see what happens; the full description is in ?par
. lty
can be used with any plot that features lines.
If you want to use abline
to draw a line that is neither horizontal nor vertical, use the argument a
for the intercept and b
for the slope. For example:
x = seq(from = 0, to = 1, by = 0.1)
y = x^2
plot(y ~ x)
abline(a = 0, b = 1)
Carry out a linear regression in which height
is the y-variable and handedness
is the x-variable. Plot the data and add the regression line. Add red dashed lines to show that the regression line goes through the means of the data. Interpret your results. Is there a relationship between handedness and height?
Functions take one or more inputs, called “arguments”, and combine them in some way to produce an output. You’ve already met a great many of the built-in R functions, as well as several of the functions in the data.table
package. Now it’s time to make some of our own. Making functions is a great way to save time by automating repetitive tasks and extend the functionality of R to make it even more useful.
Suppose R didn’t have a built-in function for computing z-scores (it does – see ?scale
); given how much they come up in class, it would be great to use in our code something like z.score(some_variable)
instead of having to compute it more manually each time. Luckily it’s straightforward to do just that and build our own function! Among other things, this example will illustrate a key point about functions: you can build them out of other functions, which means you don’t have to re-invent the wheel each time you want to do something. To make our z-score function we’ll use the built-in R functions mean
and sd
rather than writing our own mean and standard deviation functions from scratch:
z.score = function(x){
z = (x - mean(x, na.rm = TRUE))/sd(x, na.rm = TRUE)
return(z)
}
Let’s walk through the above code one step at a time:
=
, pronounced “gets” is the assignment operator, so z.score =
means that we’re assigning something to z.score
. This is just like assigning any other variable in R, except now this “variable” is a function named z.score
.function
. These are used to enclose the function’s arguments. Argument is just another word for input. Functions can take one argument, more than one argument or no arguments depending on the situation, but you always need the parentheses. Here, z.score = function(x)
means “R, I’m going to create a function called z.score
that takes a single argument that I’m going to call x
.”{
. Essentially, a function is a set of instructions that tells R how to turn some arguments (inputs) into a desired output. These instructions are called the function body and they are always enclosed in curly brackets. If you look at the bottom of the function, you’ll see the second curly bracket. A common mistake that people make is to forget the second bracket. Always make sure to close any bracket that you open.x
. Ignore missing observations.x
. Ignore missing observations.x
from the vector x
and then divide by the standard deviationz
return
. This indicates that our function should output or “return” the vector z
. In other words, z
is the “answer”.Actually, step 5 is not strictly necessary – R knows to return the output of whatever the last line of a function is. But it’s safe to be explicit and let readers of your code know what your intended output of your function is.
It was easy to create the z.score
function because, as we explored in R Tutorial #1, mathematical operations in R are vectorized. This means, for example, that if we want to add 3 to each element of a vector x
, we can just write x + 3
. In other words we don’t have to separately add three to each element of x
. Similarly, if we want to add corresponding elements of x
and y
, two vectors of the same length, we use x+y
. Try some simple examples of your own to make sure you understand how this works. See what happens if you try to add two vectors of different lengths.
So what exactly is the status of z
in the function z.score
? It looks like an ordinary R variable and in a certain sense it is. But there is something slightly subtle here: z
is a variable “inside a function”, namely a variable inside the function z.score
. This makes it different from the other variables that we’ve seen so far. To explain how, we’ll need to introduce a programming concept called “scope.”
Scope refers to where the variable “lives,” meaning how it can be accessed. We will focus on two types of scope, global and local.
If you type the command ls()
into the R console, you will see a bunch of variables. These variables have “global scope”. This means that they can be accessed from anywhere by any function. If you’ve run the code for the function z.score
above, you’ll see z.score
when you type ls()
into the console. This means that z.score
has global scope. Similarly if you type foo = 3
into the R console, foo
will have global scope. You can see this by re-running ls()
after you create foo
.
Notice, however, that you did not see z
when you typed ls()
. This is because z
is a “variable inside a function.” In other words, it has local scope. This means that z
is only accessible inside the function z.score
.
Another example of a variable with local is the column of a data.table
, for example handspan
in the dataframe survey
. We can only access handspan
by extracting it from survey
, which has global scope, using $
or []
.
Don’t worry if you don’t immediately understand this. It’s a weird concept to grasp. You can think of variables with global scope like books on a shelf. You can see each one and pick it out. In contrast, variables with local scope are like chapters in a book: you don’t know what the chapter is called until you open a specific book and read it.
It’s not at all required, but the chapter on functions from Hadley Wickham’s free online advanced R textbook is the place to start looking for all the nitty-gritty details on how this works. Caveat lector – don’t expect to understand much on your first pass through reading.
Let’s look at another example: making a function to compute the sample mean. R has a perfectly good function for calculating means, as we’ve seen, so this is mainly for illustrative purposes*. I’ll call this function mymean
to distinguish it from the built-in R function:
*: Unnecessary aside: actually, the mean
function in R will be slower than this function for subtle reasons explained here.
mymean = function(x){
x = x[!is.na(x)]
x.bar = sum(x)/length(x)
return(x.bar)
}
Again, I’ll walk you through this one step-by-step:
mymean = function(x)
tells R to create a function that takes a single argument x
and store it under the name mymean
{}
enclose the function body.x = x[!is.na(x)]
overwrites x
with a new vector containing all the elements of x
that are not NA
s. In other words, this removes missing observations. (A subtle note about global/local scope here is that doing this does not overwrite the original vector that’s passed to mymean
as an argument – it overwrites a local copy called x
)sum(x)/length(x)
calculates the sample mean: summing up the observations and dividing by the number of observations. Notice that I did not use na.rm = TRUE
in the sum
command. If I had used it, everything still would have worked but since I already removed all the missing observations in the line above, this wasn’t necessary.Let’s test out my.mean
on some real data and make sure it works. First we’ll use R’s built-in mean
function:
mean(survey$height, na.rm = TRUE)
## [1] 67.54545
and then we’ll try mymean
mymean(survey$height)
## [1] 67.54545
It works! Now I’m going to show you a slightly different version of mymean
which I’ll call mymean2
. This one will not give the correct answer:
mymean2 = function(x){
x.bar = sum(x, na.rm = TRUE)/length(x)
return(x.bar)
}
mymean2(survey$height)
## [1] 66.78652
The answer is too small! What’s going on here? The problem has to do with how I handled missing observations. Rather than throwing them away, I told sum
to ignore them. The problem is that length
does not ignore them – it simply counts up the number of elements in a vector, regardless of whether they’re NA
s. In effect we have divided by the wrong value of \(n\): the sum is being taken over all non-missing observations while the length is being computed over all observations.
It’s worth comparing mymean2
to z.score
, the first function I introduced above. In that example it was ok to use na.rm = TRUE
to handle the missing observations since this argument was available for both the mean
and sd
functions.
One more example before I set you loose to make your own function. Here’s R code to create a function called myvar
that calculates the sample variance. This function piggybacks on mymean
from above. This is a good thing. We’ve already tested mymean
and know that it works, so incorporating it into myvar
is much better than starting from scratch. Again, note how we handle missing observations.
myvar = function(x){
x = x[!is.na(x)]
s.squared = sum((x-mymean(x))^2)/(length(x) - 1)
return(s.squared)
}
We can test this against var
to make sure it works:
var(survey$handspan, na.rm = TRUE)
## [1] 4.753788
myvar(survey$handspan)
## [1] 4.753788
Looks good!
It turns out that R doesn’t have a built-in function to calculate skewness (there is one in the library moments
). Your job is to write one. Remember the definition of skewness from lecture: \[\mbox{Skewness} = \frac{\frac{1}{n}\sum_{i=1}^n(x_i - \bar{x})^3}{s^3}\] where \(s\) is the sample standard deviation. Call your function skew
. It should take one argument, a vector x
, and return the sample skewness of x
. Make sure to handle missing values appropriately. Use the built-in R functions mean
, sum
, length
and sd
to construct your skewness function. Test your function out on the handedness
data. You should get an answer of around -2.2.
## [1] -2.219056
data.table
The examples we’ve seen so far have been very simple: functions that take one argument and return a single value. In fact, functions can take multiple values and return pretty much anything. First we’ll look at an example of a function that returns a data.table
. This is a very flexible way of returning multiple values in a way that makes them easy to access.
summary.stats = function(x){
x = x[!is.na(x)]
sample.mean = mean(x)
std.dev = sd(x)
out = data.table(sample.mean, std.dev)
return(out)
}
results = summary.stats(survey$handedness)
results
## sample.mean std.dev
## 1: 0.6578936 0.4077245
results$sample.mean
## [1] 0.6578936
results$std.dev
## [1] 0.4077245
Now we’ll look at a function that takes multiple arguments: mycov
. This is essentially a stripped-down version of cov
, the built-in R equivalent. Unlike cov
, which can accept two vectors as arguments or a whole data.table
, our version will only accept two vectors as arguments: x
and y
. Handling the missing observations is a bit more complicated in this situation: we need to drop any x
for which there isn’t a corresponding y
and any y
for which there isn’t a corresponding x
– We’ve seen some examples of how to do this in R Tutorial #2 using !is.na
and &
. This is because covariance is a measure of the relationship between x
and y
. Accordingly, it is calculated using pairs of observations rather than individual observations. First I’ll give you the code, then I’ll explain the part that’s different from what we did above:
mycov = function(x, y){
keep = !is.na(x) & !is.na(y)
x = x[keep]
y = y[keep]
n = length(x)
s.xy = sum( (x - mean(x)) * (y - mean(y)) ) / (n-1)
return(s.xy)
}
Testing this,
survey[ , cov(handspan, handedness, use = "complete.obs")]
## [1] -0.01556655
survey[ , mycov(handspan, handedness)]
## [1] -0.01556655
we see that it works.
Write a function to carry out linear regression called myreg
. It should take two arguments: vectors x
and y
which you may assume have the same length. It should return a data.table
with columns a
for the intercept and b
for the slope. Make sure to handle missing values appropriately. Feel free to use any built-in R functions you like except lm
. Check your answer against the regression results for height and handspan that we calculated above. When creating this function, remember that regression is not symmetric.
##
## Call:
## lm(formula = height ~ handspan)
##
## Coefficients:
## (Intercept) handspan
## 42.251 1.229
## a b
## 1: 42.25052 1.229124