This tutorial has two parts. The first part shows you how to calculate the remaining summary statistics from the lectures and introduces a few new plotting commands. The second part describes one of the most fundamental parts of R: creating your own functions. Rather than appearing at the end, several exercises are scattered throughout the parts of this tutorial.

More Summary Statistics

First we’ll load the survey data from R Tutorial #2.

Like before we’ll only need a subset of the columns available; rather than go to the hassle of deleting these columns afterwards, we can specify which columns to keep simultaneously with reading in the data using the select argument to fread. (There is also a drop argument for the reverse – specifying columns to exclude)

library(data.table)
survey = fread("http://www.ditraglia.com/econ103/old_survey.csv",
               select = c("handedness", "height", "handspan"))

Correlation: cor

This command calculates sample correlation coefficients. If you pass it two vectors of the same length it returns the correlation between them. If you feed it a data.table or matrix it returns a matrix containing all pairwise correlations between the columns (we haven’t covered matrices in R since you’ll only need to interpret them as output for this class – for a primer on handling matrix objects, see here). Unlike the univariate summary statistics, in which we used na.rm = TRUE to ignore missing observations, for cor we use the argument use = "complete.obs". The reason for this difference is that cor can handle a matrix or data.table as its input. In this case there are actually many different ways of handling missing observations. For more details, see ?cor. Setting use = "complete.obs" removes any rows with missing observations before proceeding.

survey[ , cor(handspan, height, use = "complete.obs")]
## [1] 0.6042423

Of course, we could use the approach we saw in R Tutorial #2 as well; whichever feels more natural:

survey[!is.na(handspan) & !is.na(height), cor(handspan, height)]
## [1] 0.6042423

We can also feed the full data.table to cor like so:

cor(survey, use = "complete.obs")
##             handedness    height    handspan
## handedness  1.00000000 0.1125605 -0.01719909
## height      0.11256055 1.0000000  0.60942759
## handspan   -0.01719909 0.6094276  1.00000000
#alternatively, there's the na.omit
#  function, which does just what it says:
cor(na.omit(survey))
##             handedness    height    handspan
## handedness  1.00000000 0.1125605 -0.01719909
## height      0.11256055 1.0000000  0.60942759
## handspan   -0.01719909 0.6094276  1.00000000

We see that there is a large positive correlation between height and handspan and a slight positive correlation between height and handedness. There’s basically no correlation between handedness and handspan. The preceding matrix has ones on its main diagonal since the correlation between any variable and itself is identically one. (A good exercise for extra practice would be to prove this assertion using the formula for correlation from class. It’s not very difficult.)

If you look carefully, you’ll notice that the correlation between height and handspan is slightly different when calculated at the same time as all the other correlations. This is because of the way that use = "complete.obs" drops rows. In this case, it indicates that there is at least one observation in our dataset for which we have height and handspan but not handedness.

For practice, we can find this with judicious use of is.na and logical operators:

survey[is.na(handedness) & !is.na(height) & !is.na(handspan)]
##    handedness height handspan
## 1:         NA     70     19.5

Covariance: cov

This command works just like cor but returns covariances rather than correlations:

survey[ , cov(handspan, height, use = "complete.obs")]
## [1] 5.910786
cov(survey, use = "complete.obs")
##            handedness    height   handspan
## handedness  0.1685042  0.206959 -0.0155499
## height      0.2069590 20.062517  6.0121751
## handspan   -0.0155499  6.012175  4.8510260
cov(na.omit(survey))
##            handedness    height   handspan
## handedness  0.1685042  0.206959 -0.0155499
## height      0.2069590 20.062517  6.0121751
## handspan   -0.0155499  6.012175  4.8510260

Regression: lm and abline

This command stands for “_l_inear _m_odel" and is R’s general-purpose regression command. Its syntax is similar to that of boxplot from R Tutorial #2 in that we use a tilde (~) to indicate a functional relationship. Here the variable to the left of the tilde is the “y” variable, while the variable to the right is the “x” variable.

survey[ , lm(height ~ handspan)]
## 
## Call:
## lm(formula = height ~ handspan)
## 
## Coefficients:
## (Intercept)     handspan  
##      42.251        1.229

Remember: unlike correlation and covariance, regression is not symmetric:

survey[ , lm(handspan ~ height)]
## 
## Call:
## lm(formula = handspan ~ height)
## 
## Coefficients:
## (Intercept)       height  
##      0.5305       0.2970

It turns out that we can use the same syntax with the command plot:

survey[ , plot(handspan ~ height)]

To add the regression line, we follow the plot command with the function abline like so:

survey[ , plot(handspan ~ height)]
survey[ , abline(lm(handspan ~ height))]

Note that abline can only be used after you’ve already made a plot. It adds a line to the existing plot rather than making a plot of its own.

You can also use abline to plot different kinds of lines. For example, we can show that the regression line goes through the means of the data as follows:

survey[ , {
  plot(handspan ~ height)
  abline(lm(handspan ~ height))
  abline(v = mean(height, na.rm = TRUE),
         h = mean(handspan, na.rm = TRUE),
         col = 'red', lty = 2)
}]

We’ve introduced a new piece of syntax in the j argument – {}. This syntax allows us to run multiple commands within one pair of square brackets – here, we’re running plot and then lm twice. We’ll see {} again below when we learn about writing functions; for now, observe that it’s convenient not to have to write survey[] repeatedly.

As to the call to abline, we’ve supplied two new arguments. v indicates that we want a vertical line at the specified number(s), while h indicates that we want a horizontal line at the specified number(s).

We’ve also introduced some new plotting commands – lty stands for _l_ine _ty_pe. lty = 2 makes a dashed line. A standard line (the default) has lty = 1. Try some other line types and see what happens; the full description is in ?par. lty can be used with any plot that features lines.

If you want to use abline to draw a line that is neither horizontal nor vertical, use the argument a for the intercept and b for the slope. For example:

x = seq(from = 0, to = 1, by = 0.1)
y = x^2
plot(y ~ x)
abline(a = 0, b = 1)

Exercise #1 - Regression

Carry out a linear regression in which height is the y-variable and handedness is the x-variable. Plot the data and add the regression line. Add red dashed lines to show that the regression line goes through the means of the data. Interpret your results. Is there a relationship between handedness and height?

Creating Your Own Functions

Functions take one or more inputs, called “arguments”, and combine them in some way to produce an output. You’ve already met a great many of the built-in R functions, as well as several of the functions in the data.table package. Now it’s time to make some of our own. Making functions is a great way to save time by automating repetitive tasks and extend the functionality of R to make it even more useful.

Simple Example

Suppose R didn’t have a built-in function for computing z-scores (it does – see ?scale); given how much they come up in class, it would be great to use in our code something like z.score(some_variable) instead of having to compute it more manually each time. Luckily it’s straightforward to do just that and build our own function! Among other things, this example will illustrate a key point about functions: you can build them out of other functions, which means you don’t have to re-invent the wheel each time you want to do something. To make our z-score function we’ll use the built-in R functions mean and sd rather than writing our own mean and standard deviation functions from scratch:

z.score = function(x){
  z = (x - mean(x, na.rm = TRUE))/sd(x, na.rm = TRUE)
  return(z)
}

Let’s walk through the above code one step at a time:

  1. In R =, pronounced “gets” is the assignment operator, so z.score = means that we’re assigning something to z.score. This is just like assigning any other variable in R, except now this “variable” is a function named z.score.
  2. There are some parentheses next to function. These are used to enclose the function’s arguments. Argument is just another word for input. Functions can take one argument, more than one argument or no arguments depending on the situation, but you always need the parentheses. Here, z.score = function(x) means “R, I’m going to create a function called z.score that takes a single argument that I’m going to call x.”
  3. The next thing we see is a left curly bracket: {. Essentially, a function is a set of instructions that tells R how to turn some arguments (inputs) into a desired output. These instructions are called the function body and they are always enclosed in curly brackets. If you look at the bottom of the function, you’ll see the second curly bracket. A common mistake that people make is to forget the second bracket. Always make sure to close any bracket that you open.
  4. Now we’ll look at the function body, the list of commands contained inside of the curly brackets. The first part should be familiar: it’s made up entirely of R commands that you already know! It does the following:
  • Calculate the mean of the argument (input) x. Ignore missing observations.
  • Calculate the standard deviation the argument (input) x. Ignore missing observations.
  • Subtract the mean of x from the vector x and then divide by the standard deviation
  • Store the answer in a vector called z
  1. The last part of the function is the command return. This indicates that our function should output or “return” the vector z. In other words, z is the “answer”.

Actually, step 5 is not strictly necessary – R knows to return the output of whatever the last line of a function is. But it’s safe to be explicit and let readers of your code know what your intended output of your function is.

It was easy to create the z.score function because, as we explored in R Tutorial #1, mathematical operations in R are vectorized. This means, for example, that if we want to add 3 to each element of a vector x, we can just write x + 3. In other words we don’t have to separately add three to each element of x. Similarly, if we want to add corresponding elements of x and y, two vectors of the same length, we use x+y. Try some simple examples of your own to make sure you understand how this works. See what happens if you try to add two vectors of different lengths.

Scope

So what exactly is the status of z in the function z.score? It looks like an ordinary R variable and in a certain sense it is. But there is something slightly subtle here: z is a variable “inside a function”, namely a variable inside the function z.score. This makes it different from the other variables that we’ve seen so far. To explain how, we’ll need to introduce a programming concept called “scope.”

Scope refers to where the variable “lives,” meaning how it can be accessed. We will focus on two types of scope, global and local.

If you type the command ls() into the R console, you will see a bunch of variables. These variables have “global scope”. This means that they can be accessed from anywhere by any function. If you’ve run the code for the function z.score above, you’ll see z.score when you type ls() into the console. This means that z.score has global scope. Similarly if you type foo = 3 into the R console, foo will have global scope. You can see this by re-running ls() after you create foo.

Notice, however, that you did not see z when you typed ls(). This is because z is a “variable inside a function.” In other words, it has local scope. This means that z is only accessible inside the function z.score.

Another example of a variable with local is the column of a data.table, for example handspan in the dataframe survey. We can only access handspan by extracting it from survey, which has global scope, using $ or [].

Don’t worry if you don’t immediately understand this. It’s a weird concept to grasp. You can think of variables with global scope like books on a shelf. You can see each one and pick it out. In contrast, variables with local scope are like chapters in a book: you don’t know what the chapter is called until you open a specific book and read it.

It’s not at all required, but the chapter on functions from Hadley Wickham’s free online advanced R textbook is the place to start looking for all the nitty-gritty details on how this works. Caveat lector – don’t expect to understand much on your first pass through reading.

Roll Your Own Mean

Let’s look at another example: making a function to compute the sample mean. R has a perfectly good function for calculating means, as we’ve seen, so this is mainly for illustrative purposes*. I’ll call this function mymean to distinguish it from the built-in R function:

*: Unnecessary aside: actually, the mean function in R will be slower than this function for subtle reasons explained here.

mymean = function(x){
  x = x[!is.na(x)]
  x.bar = sum(x)/length(x)
  return(x.bar)
}

Again, I’ll walk you through this one step-by-step:

  1. mymean = function(x) tells R to create a function that takes a single argument x and store it under the name mymean
  2. The curly braces {} enclose the function body.
  3. x = x[!is.na(x)] overwrites x with a new vector containing all the elements of x that are not NAs. In other words, this removes missing observations. (A subtle note about global/local scope here is that doing this does not overwrite the original vector that’s passed to mymean as an argument – it overwrites a local copy called x)
  4. sum(x)/length(x) calculates the sample mean: summing up the observations and dividing by the number of observations. Notice that I did not use na.rm = TRUE in the sum command. If I had used it, everything still would have worked but since I already removed all the missing observations in the line above, this wasn’t necessary.

Let’s test out my.mean on some real data and make sure it works. First we’ll use R’s built-in mean function:

mean(survey$height, na.rm = TRUE)
## [1] 67.54545

and then we’ll try mymean

mymean(survey$height)
## [1] 67.54545

It works! Now I’m going to show you a slightly different version of mymean which I’ll call mymean2. This one will not give the correct answer:

mymean2 = function(x){
  x.bar = sum(x, na.rm = TRUE)/length(x)
  return(x.bar)
}
mymean2(survey$height)
## [1] 66.78652

The answer is too small! What’s going on here? The problem has to do with how I handled missing observations. Rather than throwing them away, I told sum to ignore them. The problem is that length does not ignore them – it simply counts up the number of elements in a vector, regardless of whether they’re NAs. In effect we have divided by the wrong value of \(n\): the sum is being taken over all non-missing observations while the length is being computed over all observations.

It’s worth comparing mymean2 to z.score, the first function I introduced above. In that example it was ok to use na.rm = TRUE to handle the missing observations since this argument was available for both the mean and sd functions.

Roll Your Own Variance

One more example before I set you loose to make your own function. Here’s R code to create a function called myvar that calculates the sample variance. This function piggybacks on mymean from above. This is a good thing. We’ve already tested mymean and know that it works, so incorporating it into myvar is much better than starting from scratch. Again, note how we handle missing observations.

myvar = function(x){
  x = x[!is.na(x)]
  s.squared = sum((x-mymean(x))^2)/(length(x) - 1)
  return(s.squared)
}

We can test this against var to make sure it works:

var(survey$handspan, na.rm = TRUE)
## [1] 4.753788
myvar(survey$handspan)
## [1] 4.753788

Looks good!

Exercise #2

It turns out that R doesn’t have a built-in function to calculate skewness (there is one in the library moments). Your job is to write one. Remember the definition of skewness from lecture: \[\mbox{Skewness} = \frac{\frac{1}{n}\sum_{i=1}^n(x_i - \bar{x})^3}{s^3}\] where \(s\) is the sample standard deviation. Call your function skew. It should take one argument, a vector x, and return the sample skewness of x. Make sure to handle missing values appropriately. Use the built-in R functions mean, sum, length and sd to construct your skewness function. Test your function out on the handedness data. You should get an answer of around -2.2.

## [1] -2.219056

Returning a data.table

The examples we’ve seen so far have been very simple: functions that take one argument and return a single value. In fact, functions can take multiple values and return pretty much anything. First we’ll look at an example of a function that returns a data.table. This is a very flexible way of returning multiple values in a way that makes them easy to access.

summary.stats = function(x){
  x = x[!is.na(x)]
  sample.mean = mean(x)
  std.dev  = sd(x)
  out = data.table(sample.mean, std.dev)
  return(out)
}
results = summary.stats(survey$handedness)
results
##    sample.mean   std.dev
## 1:   0.6578936 0.4077245
results$sample.mean
## [1] 0.6578936
results$std.dev
## [1] 0.4077245

Multiple Arguments

Now we’ll look at a function that takes multiple arguments: mycov. This is essentially a stripped-down version of cov, the built-in R equivalent. Unlike cov, which can accept two vectors as arguments or a whole data.table, our version will only accept two vectors as arguments: x and y. Handling the missing observations is a bit more complicated in this situation: we need to drop any x for which there isn’t a corresponding y and any y for which there isn’t a corresponding x – We’ve seen some examples of how to do this in R Tutorial #2 using !is.na and &. This is because covariance is a measure of the relationship between x and y. Accordingly, it is calculated using pairs of observations rather than individual observations. First I’ll give you the code, then I’ll explain the part that’s different from what we did above:

mycov = function(x, y){
  
  keep = !is.na(x) & !is.na(y)
  x = x[keep]
  y = y[keep]
  
  n = length(x)
  
  s.xy = sum( (x - mean(x)) * (y - mean(y)) ) / (n-1)
  return(s.xy)
}

Testing this,

survey[ , cov(handspan, handedness, use = "complete.obs")]
## [1] -0.01556655
survey[ , mycov(handspan, handedness)]
## [1] -0.01556655

we see that it works.

Exercise #3

Write a function to carry out linear regression called myreg. It should take two arguments: vectors x and y which you may assume have the same length. It should return a data.table with columns a for the intercept and b for the slope. Make sure to handle missing values appropriately. Feel free to use any built-in R functions you like except lm. Check your answer against the regression results for height and handspan that we calculated above. When creating this function, remember that regression is not symmetric.

## 
## Call:
## lm(formula = height ~ handspan)
## 
## Coefficients:
## (Intercept)     handspan  
##      42.251        1.229
##           a        b
## 1: 42.25052 1.229124