This is a long tutorial, but the material is fairly straightforward. If you run into any trouble feel free to post on Piazza.

The most crucial piece of advice for learning a programming language is to recognize it requires the same approach as learning a foreign language – you’ll benefit most from being actively engaged in learning. That means not just reading along with these tutorials, but actively processing what it says and running the code yourself.

Part 1: Installing R

Carry out the following two steps in order

  1. Go to http://cran.r-project.org/ and install the version of R for your operating system.

  2. Go to http://rstudio.org/download/desktop and click the link listed under “Recommended for Your System”. Follow the instructions to install RStudio.

To make sure this worked, open the program RStudio and go to File > New > R Script. This will open a blank text document. In the document, type the text given in the box below and then click and drag to highlight both lines of code and click the button marked “Run.” If everything is working correctly, the console should display TRUE.

x = 5
x == 5

Congratulations: you’ve just written your first R script! To save it, go to File > Save As, and choose a name. NOTE: Always save your scripts as .R files so they’ll open in RStudio by default.

Note that you can run one line of your script at a time by moving your cursor to that line and pressing CONTROL-ENTER or COMMAND-RETURN depending on whether you’re running Mac OSX, Linux or Windows. Another helpful shortcut is CONTROL-A (COMMAND-A on Mac) which highlights all of the lines of code in the text editor.

Part 2: The Absolute Basics

Here are some of the most fundamental things you can do with R.

Arithmetic

#add numbers
1 + 1
## [1] 2
#subtract them
8 - 4
## [1] 4
#divide
13/2
## [1] 6.5
#multiply
4*pi
## [1] 12.56637
#exponentiate
2^10
## [1] 1024

Logical Comparison

3 < 4
## [1] TRUE
3 > 4
## [1] FALSE
#contrast with 3 = 4; see section about variables below
3 == 4
## [1] FALSE
#!= means "not equal to"
3 != 4
## [1] TRUE
4 >= 5
## [1] FALSE
4 <= 5
## [1] TRUE
2 + 2 == 5
## [1] FALSE
10 - 6 == 4
## [1] TRUE

Strings (text)

Numbers are bread and butter for computers, but text is what will facilitate understanding for us mere mortals.

'Econometrics is awesome'
## [1] "Econometrics is awesome"
#R delimits strings with EITHER double or single quotes.
#  There is only a very minimal difference
"Econometrics is still awesome"
## [1] "Econometrics is still awesome"

Variables

Just like in algebra, variables are a great form of shorthand. Instead of writing 3.1415926… all the time, we can just write pi.

Assignment to a variable happens from right to left – the value on the right side gets assigned to the name on the left side. You can use nearly anything as a variable name in R. The only rules are:

  1. . and _ are OK, but no other symbols.
  2. Your variable name must not start with a number or _ (2squared and _one are illegal).

[A note for those of you who have programming experience: while R supports object-oriented programming, periods . do not have a special meaning in the language. For historical reasons, R programmers often use periods in place of underscores in variable names, but either works. Just be consistent to keep your code readable.]

x = 42
x / 2
## [1] 21
#if we assign something else to x,
#  the old value is deleted
x = "Melody to Funkytown!"
x
## [1] "Melody to Funkytown!"
x = 5
x == 5
## [1] TRUE
foo = 3
bar = 5
foo.bar = foo + bar
foo.bar
## [1] 8
foo.bar2 = 2 * foo.bar
foo.bar2
## [1] 16
foo_bar = foo - bar
foo_bar
## [1] -2

Note: In programmer speak, = here is an “assignment operator” – it’s the thing used to assign values to a variable name. R also has a second assignment operator that you’re bound to see sooner or later, <-. So x <- 42 and x = 42 are identical, and both accomplish the task of assigning the value of 42 to the name x. We’ll try to stick with using = since it’s easier to type and in some ways more intuitive. See this wonderful post for some more history and a very subtle difference between the two operators that you needn’t concern yourself with for now.

Vectors & Types

In R, a vector is just a (ordered) set of related things. You should basically think of it like a column in Excel.

x = c(4, 7, 9)
x
## [1] 4 7 9
y = c('a', 'b', 'c')
y
## [1] "a" "b" "c"

4, 7, and 9 are “related” because they’re all numbers; a, b and c are all letters. Having variables is becoming more convenient – instead of having to write c(4, 7, 9) all the time, we can just write x.

What happens when we try and combine things that aren’t so obviously related?

x = c(1, TRUE, "three")
x
## [1] "1"     "TRUE"  "three"

Note the quotation marks. R has converted 1 and TRUE into text representations. That’s because 1 and TRUE are different _type_s than "three". There are four basic types of variables your likely to encounter in this class, listed here in heirarchical order:

  1. logical: TRUE or FALSE
  2. integer: 0L, -1L, 1L, etc. A (real) number without a decimal part. Technical note: they take up less space in the computer than numbers with decimals.
  3. numeric: pi, 0.34, 1.4043, etc. A real number.
  4. character: "some words", "more words", etc.

Vectors are converted to the highest number on this list present – x above has "three" so the whole vector becomes a character.

Vector Arithmetic and Functions

Vectors make it easy to do many computations all at once – adding one to a list of numbers, dividing all of them by 3, etc. And as long as two vectors are the same length, we can combine them in natural ways:

x = c(1, 2, 3)
x + 4
## [1] 5 6 7
x/3
## [1] 0.3333333 0.6666667 1.0000000
-x
## [1] -1 -2 -3
x^3
## [1]  1  8 27
y = c(3, 2, 1)
x - y
## [1] -2  0  2
x * y
## [1] 3 4 3
x/y
## [1] 0.3333333 1.0000000 3.0000000
x > 2
## [1] FALSE FALSE  TRUE
x >= 2
## [1] FALSE  TRUE  TRUE

Just like in math, a function is a way of mapping input to output, and just like in most math classes, you can spot functions since they use parentheses: (). We’ve already seen the _c_oncatenate function c used (for example) to create vectors.

We can also apply any number of ubiquitous functions to our vector input. Just a small taste:

x = c(1, 2, 3)
#sum: add up the elements of a vector
sum(x)
## [1] 6
#Just like you can use the command sum to add up the
#  elements of a numeric vector, you can use
#  prod to take their product:
prod(x)
## [1] 6
sqrt(x)
## [1] 1.000000 1.414214 1.732051
y = c(-1, 2, 4)
#abs: absolute value
abs(y)
## [1] 1 2 4
#exp: exponential. exp(x) is e^x
exp(y)
## [1]  0.3678794  7.3890561 54.5981500
#log: _natural_ logarithm (base e)
log(x)
## [1] 0.0000000 0.6931472 1.0986123
#Note that these functions interpret their input 
#  as *radians* rather than degrees.
sin(x) + cos(y)
## [1]  1.3817733  0.4931506 -0.5125236
max(y)
## [1] 4
min(y)
## [1] -1
range(y)
## [1] -1  4
mean(x)
## [1] 2
median(x)
## [1] 2

Another thing that we will do all the time is use regularly-spaced sequences of numbers. These are created in R with : or seq:

x = 1:10
x
##  [1]  1  2  3  4  5  6  7  8  9 10
y = 10:1
y
##  [1] 10  9  8  7  6  5  4  3  2  1
#some times the gap is not 1
z = seq(0, 1, by = .02)
z
##  [1] 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26
## [15] 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 0.50 0.52 0.54
## [29] 0.56 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.74 0.76 0.78 0.80 0.82
## [43] 0.84 0.86 0.88 0.90 0.92 0.94 0.96 0.98 1.00
#other times we care less about the gap and more
#  more about how many points we get out
w = seq(0, 1, length.out = 20)

In addition to math/arithmetic functions, there is a litany of basic programming functions that you’re likely to use all of the time:

x = 99:32
#length: how many elements (items) are there in x?
length(x)
## [1] 68
y = c("hey you!", "out there in the cold")
#what TYPE of variable does R think this is?
class(y)
## [1] "character"
#rep: repeat/reproduce
rep(y, 4)
## [1] "hey you!"              "out there in the cold" "hey you!"             
## [4] "out there in the cold" "hey you!"              "out there in the cold"
## [7] "hey you!"              "out there in the cold"
#head/tail: display only the beginning/end
#  of an object -- very useful for very
#  large objects
x = 1:100000
head(x)
## [1] 1 2 3 4 5 6
tail(x)
## [1]  99995  99996  99997  99998  99999 100000

Subsetting Vectors: [

Often we want to examine only part of a vector, most commonly the part of a vector that satisfies some condition, but also looking at the first or last few elements. To do this we extract or subset those elements by using [:

x = c(5, 4, 1)
x[1]
## [1] 5
x[3]
## [1] 1
x[1:2]
## [1] 5 4
x[2:3]
## [1] 4 1

In the syntax x[something], note that something is itself a vector! So the above is all short-hand for the more complicated types of subsets:

x = 20:30
x
##  [1] 20 21 22 23 24 25 26 27 28 29 30
x[c(1, 3, 5)]
## [1] 20 22 24
x[c(5, 9)]
## [1] 24 28
x[seq(1, 10, by = 2)]
## [1] 20 22 24 26 28

Besides being an integer, something can be a logical vector of the same length as the vector itself:

x = c(5, 6, 7)
x[c(TRUE, TRUE, FALSE)]
## [1] 5 6
x[c(FALSE, TRUE, FALSE)]
## [1] 6
x[c(FALSE, FALSE, TRUE)]
## [1] 7

Most commonly we’ll do something that’s identical to the above but reads more naturally:

x = c(-1, 0, 1)
x > 0
## [1] FALSE FALSE  TRUE
x[x > 0]
## [1] 1
x[x <= 0]
## [1] -1  0

We can also replace parts of a vector by subsetting:

x = c(-1, 5, 10)
x[3] = 4
x
## [1] -1  5  4
x[x < 0] = 0

Named Vectors

It’s also often useful to name our vectors to help organize the information. Suppose we were keeping track of the ages of the Trumps:

trump_ages = c(70, 46, 38, 34, 32, 22, 9)

This is nice, but much more useful if we keep track of who each element represents:

trump_ages = c(Donald = 70, Melania = 46, Donald_Jr = 38, Ivanka = 34,
               Eric = 32, Tiffany = 22, Barron = 9)
trump_ages
##    Donald   Melania Donald_Jr    Ivanka      Eric   Tiffany    Barron 
##        70        46        38        34        32        22         9

We can also use the names function to assign names; this is sometimes easier, e.g., if the names have spaces:

names(trump_ages) = c("Donald", "Melania", "Donald, Jr.", "Ivanka", "Eric", "Tiffany", "Barron")
trump_ages
##      Donald     Melania Donald, Jr.      Ivanka        Eric     Tiffany 
##          70          46          38          34          32          22 
##      Barron 
##           9

This also makes code for subsetting much easier to read, since we can subset by the names:

trump_ages["Donald"]
## Donald 
##     70
trump_ages[c("Donald", "Barron")]
## Donald Barron 
##     70      9

Getting Help: Documentation

If you’re unsure of how something works in R – what the arguments are to a function, how it works, etc. – your first step is to check the documentation:

?sum
?cos
?"="

Lists

We saw above that R doesn’t like vectors to have different types: c(TRUE, 1, "Frank") becomes c("TRUE", "1", "Frank"). But storing objects with different types is absolutely fundamental to data analysis.

R has a different type of object besides a vector used to store data of different types side-by-side: a list:

x = list(TRUE, 1, "Frank")
x
## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] 1
## 
## [[3]]
## [1] "Frank"

Note how different the output looks, as compared to using c!! The quotation marks are gone except for the last component. You can ignore the mess of [[ and [ for now, but as an intimation, consider some more complicated lists:

x = list(c(1, 2), c("a", "b"), c(TRUE, FALSE), c(5L, 6L))
x
## [[1]]
## [1] 1 2
## 
## [[2]]
## [1] "a" "b"
## 
## [[3]]
## [1]  TRUE FALSE
## 
## [[4]]
## [1] 5 6
y = list(list(1, 2, 3), list(4:5), 6)
y
## [[1]]
## [[1]][[1]]
## [1] 1
## 
## [[1]][[2]]
## [1] 2
## 
## [[1]][[3]]
## [1] 3
## 
## 
## [[2]]
## [[2]][[1]]
## [1] 4 5
## 
## 
## [[3]]
## [1] 6

x is a list which has 4 components, each of which is a vector with 2 components. This gives the first hint at how R treats a dataset with many variables of different types – at core, R stores a data set in a list!

y is a nested list – it’s a list that has lists for some of its components. This is very useful for more advanced operations, but probably won’t come up for quite some time, so don’t worry if you haven’t wrapped your head around this yet.

Packages

One of the things that makes R truly exceptional is its vast library of user-contributed packages.

R comes pre-loaded with a boat-load of the most common functions / methods of analysis. But in no way is this congenital library complete.

Complementing this core of the most common operations are external packages, which are basically sets of functions designed to accomplish specific tasks.

Best of all, unlike some super-expensive programming languages, all of the thousands of packages available to R users (most importantly through CRAN, the Comprehensive R Archive Network) are completely free of charge.

The two most important things to know about packages for now is where to find them, how to install them, and how to load them.

We’ll work extensively with the data.table package, which was built for working with huge data sets.

Where to find packages

Long story short: Google. Got a particular statistical technique in mind? The best R package for this is almost always the top Google result if asked correctly.

How to install packages

Just use install.packages!

install.packages("data.table")

This will download the code from the package to your computer to a place that R understands.

We do not yet have access to the functions in the package. We have to load it first.

How to load packages

Just add it to your library!

library(data.table)

Et voila! You’ll now have access to all of the awesome functions in the data.table package. You can also Google “tutorial data.table” (or in general “tutorial [package name]”) and you’re very likely to find a trove of sites trying to help you learn the package.

data.tables

Data sets are the lifeblood of a data lover!

As mentioned above, data sets in R basically lists where every element has the same length. In basic R, this is done with a data.frame, but it’ll be easier for a beginner to understand the syntax of a data.table, so you can forget about data.frames for now.

We can build a data.table from scratch with the data.table command. This command lets you build up a data.table from several vectors of the same length:

foo = 1:5
bar = 2 * foo
foo.bar = data.table(foo, bar)
foo.bar
##    foo bar
## 1:   1   2
## 2:   2   4
## 3:   3   6
## 4:   4   8
## 5:   5  10

In the preceding example I built a data.table with only two columns, but you can add as many as you like. Just separate them by commas:

y = -4:0
data.table(foo, bar, y)
##    foo bar  y
## 1:   1   2 -4
## 2:   2   4 -3
## 3:   3   6 -2
## 4:   4   8 -1
## 5:   5  10  0

Subsetting data

When you’re working with data, you’ll often want to look at subsets that satisfy a particular condition. First we’ll set up a simple data.table:

location = c("New York", "Chicago", "Boston", "Boston", "New York")
salary = c(70000, 80000, 60000, 50000, 45000)
title = c("Office Manager", "Research Assistant", "Analyst", "Office Manager", "Analyst")
hours = c(50, 56, 65, 40, 50)
jobsearch = data.table(location, salary, title, hours)
jobsearch
##    location salary              title hours
## 1: New York  70000     Office Manager    50
## 2:  Chicago  80000 Research Assistant    56
## 3:   Boston  60000            Analyst    65
## 4:   Boston  50000     Office Manager    40
## 5: New York  45000            Analyst    50

Now, suppose you wanted to see only the jobs in New York. You could select them as follows:

jobsearch[location == 'New York']
##    location salary          title hours
## 1: New York  70000 Office Manager    50
## 2: New York  45000        Analyst    50

Notice the use of the double equal sign. This command is testing a logical condition. If you use a single equals sign, this won’t work since = is what is used to name the arguments to a function in R. The preceding command looks at the data.table jobsearch and then the column location and checks which entries satisfy the condition that the location is "New York". Finally, the function returns only these rows of the data.table.

Now suppose you wanted to extract only those jobs that pay more than $50,000. The command for this is as follows:

jobsearch[salary > 50000]
##    location salary              title hours
## 1: New York  70000     Office Manager    50
## 2:  Chicago  80000 Research Assistant    56
## 3:   Boston  60000            Analyst    65

Finally, suppose the most you’re willing to work per week is 50 hours. Here are the jobs you should consider:

jobsearch[hours <= 50]
##    location salary          title hours
## 1: New York  70000 Office Manager    50
## 2:   Boston  50000 Office Manager    40
## 3: New York  45000        Analyst    50

Loading External Data & Data Summary

The vast majority of the time, you won’t be using data that you type in by hand – you’ll be importing data from external sources. One of the most common ways to find such data is in comma-separated format – such files are structured such that each row represents a row of data, and columns are separated by a comma (actually, any separating character is possible), e.g., like this:

name,age,company
Mike,24,BCG
Rodrigo,25,Uber
Frank,28,FMC
Ethan,22,AirBnB

It’s very easy to read files like this into R very quickly using fread. The weather site Weather Underground offers lots of historical data in such tabular format. E.g., the data on this page about the weather recorded at Philadelphia International Airport is available as a .csv online here.

We can read this into R like so:

weather = fread('http://michaelchirico.github.io/philly_weather_data.csv')
weather
##           EST Max TemperatureF Mean TemperatureF Min TemperatureF
##  1:  2017-1-1               51                42               34
##  2:  2017-1-2               43                38               32
##  3:  2017-1-3               48                44               41
##  4:  2017-1-4               54                45               35
##  5:  2017-1-5               34                31               28
##  6:  2017-1-6               32                29               26
##  7:  2017-1-7               24                22               19
##  8:  2017-1-8               24                20               15
##  9:  2017-1-9               23                18               12
## 10: 2017-1-10               42                28               14
## 11: 2017-1-11               54                46               37
## 12: 2017-1-12               66                56               45
## 13: 2017-1-13               61                48               35
## 14: 2017-1-14               36                33               30
##     Max Dew PointF MeanDew PointF Min DewpointF Max Humidity Mean Humidity
##  1:             29             26            24           75            48
##  2:             41             37            25           96            88
##  3:             46             42            39           97            93
##  4:             46             36            10           96            74
##  5:             27             17             8           92            58
##  6:             28             21             9           93            70
##  7:             18             13             4           89            68
##  8:              7              3            -1           57            46
##  9:              9              6             4           73            56
## 10:             32             18             9           84            62
## 11:             46             40            33           93            80
## 12:             52             50            46           90            74
## 13:             50             30            14           89            63
## 14:             23             17            11           61            50
##     Min Humidity Max Sea Level PressureIn Mean Sea Level PressureIn
##  1:           30                    30.42                     30.19
##  2:           66                    30.45                     30.35
##  3:           89                    30.18                     29.79
##  4:           14                    29.83                     29.58
##  5:           23                    30.05                     29.98
##  6:           31                    30.31                     30.05
##  7:           33                    30.33                     30.24
##  8:           29                    30.69                     30.44
##  9:           36                    30.81                     30.72
## 10:           38                    30.66                     30.48
## 11:           59                    30.37                     30.26
## 12:           50                    30.19                     30.11
## 13:           36                    30.71                     30.45
## 14:           30                    30.71                     30.62
##     Min Sea Level PressureIn Max VisibilityMiles Mean VisibilityMiles
##  1:                    29.94                  10                   10
##  2:                    30.19                  10                    6
##  3:                    29.53                  10                    4
##  4:                    29.48                  10                    6
##  5:                    29.85                  10                    8
##  6:                    29.92                  10                    7
##  7:                    30.18                  10                    5
##  8:                    30.25                  10                   10
##  9:                    30.67                  10                   10
## 10:                    30.20                  10                   10
## 11:                    30.19                  10                    8
## 12:                    30.02                  10                   10
## 13:                    30.14                  10                   10
## 14:                    30.49                  10                   10
##     Min VisibilityMiles Max Wind SpeedMPH Mean Wind SpeedMPH
##  1:                  10                18                 12
##  2:                   2                15                  8
##  3:                   2                15                 11
##  4:                   2                24                 12
##  5:                   1                21                 11
##  6:                   0                13                  5
##  7:                   0                17                 13
##  8:                  10                24                 15
##  9:                  10                10                  7
## 10:                  10                17                  7
## 11:                   5                22                 12
## 12:                   6                21                 15
## 13:                  10                24                 13
## 14:                   6                10                  5
##     Max Gust SpeedMPH PrecipitationIn CloudCover   Events WindDirDegrees
##  1:                25            0.00          4                     274
##  2:                25            0.16          8     Rain             52
##  3:                28            0.20          8     Rain             31
##  4:                34            0.01          6     Rain            250
##  5:                28            0.03          7     Snow            253
##  6:                17            0.04          8     Snow            336
##  7:                25            0.08          7 Fog-Snow              1
##  8:                31            0.00          1                     294
##  9:                NA            0.00          4                     228
## 10:                29            0.01          7                     175
## 11:                30            0.24          7     Rain            212
## 12:                29            0.03          7     Rain            218
## 13:                30               T          6                     326
## 14:                NA            0.00          8     Snow             67
summary(weather)
##      EST            Max TemperatureF Mean TemperatureF Min TemperatureF
##  Length:14          Min.   :23.00    Min.   :18.00     Min.   :12.00   
##  Class :character   1st Qu.:32.50    1st Qu.:28.25     1st Qu.:20.75   
##  Mode  :character   Median :42.50    Median :35.50     Median :31.00   
##                     Mean   :42.29    Mean   :35.71     Mean   :28.79   
##                     3rd Qu.:53.25    3rd Qu.:44.75     3rd Qu.:35.00   
##                     Max.   :66.00    Max.   :56.00     Max.   :45.00   
##                                                                        
##  Max Dew PointF  MeanDew PointF  Min DewpointF    Max Humidity  
##  Min.   : 7.00   Min.   : 3.00   Min.   :-1.00   Min.   :57.00  
##  1st Qu.:24.00   1st Qu.:17.00   1st Qu.: 8.25   1st Qu.:77.25  
##  Median :30.50   Median :23.50   Median :10.50   Median :89.50  
##  Mean   :32.43   Mean   :25.43   Mean   :16.79   Mean   :84.64  
##  3rd Qu.:46.00   3rd Qu.:36.75   3rd Qu.:24.75   3rd Qu.:93.00  
##  Max.   :52.00   Max.   :50.00   Max.   :46.00   Max.   :97.00  
##                                                                 
##  Mean Humidity    Min Humidity   Max Sea Level PressureIn
##  Min.   :46.00   Min.   :14.00   Min.   :29.83           
##  1st Qu.:56.50   1st Qu.:30.00   1st Qu.:30.22           
##  Median :65.50   Median :34.50   Median :30.39           
##  Mean   :66.43   Mean   :40.29   Mean   :30.41           
##  3rd Qu.:74.00   3rd Qu.:47.00   3rd Qu.:30.68           
##  Max.   :93.00   Max.   :89.00   Max.   :30.81           
##                                                          
##  Mean Sea Level PressureIn Min Sea Level PressureIn Max VisibilityMiles
##  Min.   :29.58             Min.   :29.48            Min.   :10         
##  1st Qu.:30.07             1st Qu.:29.93            1st Qu.:10         
##  Median :30.25             Median :30.16            Median :10         
##  Mean   :30.23             Mean   :30.07            Mean   :10         
##  3rd Qu.:30.45             3rd Qu.:30.20            3rd Qu.:10         
##  Max.   :30.72             Max.   :30.67            Max.   :10         
##                                                                        
##  Mean VisibilityMiles Min VisibilityMiles Max Wind SpeedMPH
##  Min.   : 4.000       Min.   : 0.000      Min.   :10.00    
##  1st Qu.: 6.250       1st Qu.: 2.000      1st Qu.:15.00    
##  Median : 9.000       Median : 5.500      Median :17.50    
##  Mean   : 8.143       Mean   : 5.286      Mean   :17.93    
##  3rd Qu.:10.000       3rd Qu.:10.000      3rd Qu.:21.75    
##  Max.   :10.000       Max.   :10.000      Max.   :24.00    
##                                                            
##  Mean Wind SpeedMPH Max Gust SpeedMPH PrecipitationIn      CloudCover   
##  Min.   : 5.00      Min.   :17.00     Length:14          Min.   :1.000  
##  1st Qu.: 7.25      1st Qu.:25.00     Class :character   1st Qu.:6.000  
##  Median :11.50      Median :28.50     Mode  :character   Median :7.000  
##  Mean   :10.43      Mean   :27.58                        Mean   :6.286  
##  3rd Qu.:12.75      3rd Qu.:30.00                        3rd Qu.:7.750  
##  Max.   :15.00      Max.   :34.00                        Max.   :8.000  
##                     NA's   :2                                           
##     Events          WindDirDegrees 
##  Length:14          Min.   :  1.0  
##  Class :character   1st Qu.: 94.0  
##  Mode  :character   Median :223.0  
##                     Mean   :194.1  
##                     3rd Qu.:268.8  
##                     Max.   :336.0  
## 
names(weather)
##  [1] "EST"                       "Max TemperatureF"         
##  [3] "Mean TemperatureF"         "Min TemperatureF"         
##  [5] "Max Dew PointF"            "MeanDew PointF"           
##  [7] "Min DewpointF"             "Max Humidity"             
##  [9] "Mean Humidity"             "Min Humidity"             
## [11] "Max Sea Level PressureIn"  "Mean Sea Level PressureIn"
## [13] "Min Sea Level PressureIn"  "Max VisibilityMiles"      
## [15] "Mean VisibilityMiles"      "Min VisibilityMiles"      
## [17] "Max Wind SpeedMPH"         "Mean Wind SpeedMPH"       
## [19] "Max Gust SpeedMPH"         "PrecipitationIn"          
## [21] "CloudCover"                "Events"                   
## [23] "WindDirDegrees"

NB: More typically, instead of a URL, you’ll give fread the path to where a .csv file is stored on your computer.

More on data.table Access

One of the most fundamental R skills you’ll need to learn is how to access parts of a data.table or vector. This can be a little confusing at first since there are usually several different ways to accomplish the same thing. This section is intended to add more clarity to some of the material on data.tables from above. The best way to learn this is to play around with different commands and see what happens. There are some exercises at the end of the tutorial to do just that. If you don’t get the result you were expecting, try to think about why.

First I’ll build a simple data.table from the following vectors:

person = c("Linus", "Snoopy", "Lucy", "Woodstock")
age = c(5, 8, 6, 2)
weight = c(40, 25, 50, 1)
my.data.table = data.table(person, age, weight)
my.data.table
##       person age weight
## 1:     Linus   5     40
## 2:    Snoopy   8     25
## 3:      Lucy   6     50
## 4: Woodstock   2      1

Fact #1: You can use the same principles to select subsets of vectors and data.tables by position

The only real difference here is that vectors are one-dimensional

age[1:2]
## [1] 5 8
age[c(1,3)]
## [1] 5 6

whereas data.tables are two-dimensional; the first dimension is rows:

my.data.table[1:2]
##    person age weight
## 1:  Linus   5     40
## 2: Snoopy   8     25
my.data.table[c(1, 3)]
##    person age weight
## 1:  Linus   5     40
## 2:   Lucy   6     50

The second dimension is columns; we can specify rows and columns by giving two numbers inside []:

#what is the first row of the third column?
my.data.table[1, 3]
##    weight
## 1:     40
#what are the first three rows of the third column?
my.data.table[1:3, 3]
##    weight
## 1:     40
## 2:     25
## 3:     50

If you leave the part before the comma blank, you get all the rows:

my.data.table[ , 2:3]
##    age weight
## 1:   5     40
## 2:   8     25
## 3:   6     50
## 4:   2      1

If you leave the part after the comma blank (or don’t include it at all), you get all the columns:

my.data.table[c(1,3), ]
##    person age weight
## 1:  Linus   5     40
## 2:   Lucy   6     50
my.data.table[c(2,4)]
##       person age weight
## 1:    Snoopy   8     25
## 2: Woodstock   2      1

Fact #2: There are three ways to access the columns of a data.table by name.

The first way is to use [["COLUMN NAME GOES HERE"]]

my.data.table[["weight"]]
## [1] 40 25 50  1

The second is to use $, which is often faster to type since it doesn’t require the use of quotation marks:

my.data.table$weight
## [1] 40 25 50  1

Both of the preceding methods are limited in that they only allow us to reference a single column. We can reference multiple columns as follows:

my.data.table[ , c("person", "weight")]
##       person weight
## 1:     Linus     40
## 2:    Snoopy     25
## 3:      Lucy     50
## 4: Woodstock      1

Since we left the part before the comma blank, this gave us all the rows. We could get the same thing by accessing these columns by position (though this is generally not recommended)

my.data.table[ , 2]
##    age
## 1:   5
## 2:   8
## 3:   6
## 4:   2
my.data.table[ , c(1,2)]
##       person age
## 1:     Linus   5
## 2:    Snoopy   8
## 3:      Lucy   6
## 4: Woodstock   2
my.data.table[ , 1:2]
##       person age
## 1:     Linus   5
## 2:    Snoopy   8
## 3:      Lucy   6
## 4: Woodstock   2

In some cases it’s easier to access columns of a data.table by name and in others it’s easier to access them by position.

Part 3: Exercises

If you can’t get R and RStudio to work on your computer, you can do the exercises on the R Fiddle website

http://www.r-fiddle.org/#/

  1. Calculate how many minutes there are in a January
60 * 24 * 7 * 31
## [1] 312480
  1. Add up the numbers 3 1 4 1 5 9 2 6 without using any plus signs
sum(c(3,1,4,1,5,9,2,6))
## [1] 31
  1. Load the help file for the function summary, and use summary on an object.
?summary
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
  1. Suppose I ran the following R commands in order. What result would I get after the fourth command? Do not use R to answer this: think it through and then check your answer.
x = 5
y = 7
z = x + y 
z + 3 == 15
## [1] TRUE
  1. How can I get R to print out "Go Penn!" thirty times without repeatedly typing this by hand?
rep("Go Penn", times = 30)
##  [1] "Go Penn" "Go Penn" "Go Penn" "Go Penn" "Go Penn" "Go Penn" "Go Penn"
##  [8] "Go Penn" "Go Penn" "Go Penn" "Go Penn" "Go Penn" "Go Penn" "Go Penn"
## [15] "Go Penn" "Go Penn" "Go Penn" "Go Penn" "Go Penn" "Go Penn" "Go Penn"
## [22] "Go Penn" "Go Penn" "Go Penn" "Go Penn" "Go Penn" "Go Penn" "Go Penn"
## [29] "Go Penn" "Go Penn"
  1. Create a vector called x containing the sequence -1, -0.9, … 0, 0.1, …, 0.9, 1 and then display the result
x = seq(-1, 1, 0.1)
x
##  [1] -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1  0.0  0.1  0.2  0.3
## [15]  0.4  0.5  0.6  0.7  0.8  0.9  1.0
  1. Create two vectors: wizards and ranking. The vector wizards should contain the following names: Harry, Ron, Fred, George, Sirius. The vector ranking should contain the following numbers: 4, 2, 5, 1, 3 in it. Make sure to put these in order.
#Remember that the elements of character vectors need to be enclosed in quotation marks. Either single or double quotes will work.
wizards = c("Harry", "Ron", "Fred", "George", "Sirius")
ranking = c(4, 2, 5, 1, 3)
  1. Extract the second element of the vector wizards.
wizards[2]
## [1] "Ron"
  1. Replace the names Fred, George and Sirius in the vector wizards with Hermione, Ginny and Malfoy, respectively.
#There are several different ways to do this. Here are two possibilities.
wizards[c(3, 4, 5)] = c("Hermione", "Ginny", "Malfoy")
wizards[3:5] = c("Hermione", "Ginny", "Malfoy")
  1. Someone who hasn’t read Harry Potter needs labels to determine who these characters are. Assign names to the elements of the vector wizards: Lead, Friend, Friend, Wife, Rival. Display the result.
names(wizards) = c("Lead", "Friend", "Friend", "Wife", "Rival")
wizards
##       Lead     Friend     Friend       Wife      Rival 
##    "Harry"      "Ron" "Hermione"    "Ginny"   "Malfoy"
  1. An avid reader of Harry Potter argues that Malfoy is not Harry’s rival by the end of the series. Change Rival to Ex-Rival.
names(wizards)[5] = "Ex-Rival"
names(wizards)
## [1] "Lead"     "Friend"   "Friend"   "Wife"     "Ex-Rival"
  1. In 2009 Steve’s income was $50,000 and his total expenses were $35,000. In 2010 his income was $52,000 and his expenses were $34,000. In 2011, his income was $52,500 and his expenses were $38,000. Finally, in 2012 Steve’s earnings were $48,000 and his expenses were $40,000. Create three vectors to store this information in parallel: years, income and expenses.
years = c(2009, 2010, 2011, 2012)
income = c(50000, 52000, 52500, 48000)
expenses = c(35000, 34000, 38000, 40000)
  1. Following on from the previous question, calculate Steve’s annual savings and store this in a vector called savings.
savings = income - expenses
  1. Assuming zero interest on bank deposits (roughly accurate at the moment), calculate the total amount that Steve has saved over all the years for which we have data.
sum(savings)
## [1] 55500
  1. Create a vector called z that lists the numbers from 12 to 23 in descending order.
z = 23:12
z
##  [1] 23 22 21 20 19 18 17 16 15 14 13 12
  1. Replace the number 13 with the number 7 in z.
z[z == 13] = 7
z
##  [1] 23 22 21 20 19 18 17 16 15 14  7 12
  1. Twenty-six students took the midterm. Here are their scores: 18, 95, 76, 90, 84, 83, 80, 79, 63, 76, 55, 78, 90, 81, 88, 89, 92, 73, 83, 72, 85, 66, 77, 82, 99, 87. Assign these values to a vector called scores.
scores = c(18, 95, 76, 90, 84, 83, 80, 79, 63, 76, 55, 78, 90, 81, 88, 89, 92, 73, 83, 72, 85, 66, 77, 82, 99, 87)
  1. Calculate the mean, median, and range of the scores.
mean(scores)
## [1] 78.5
median(scores)
## [1] 81.5
range(scores)
## [1] 18 99
  1. Create three vectors. First store the numeric values 21, 26, 51, 22, 160, 160, 160 in a vector called age. Next, store the names Achilles, Hector, Priam, Paris, Apollo, Athena, Aphrodite in a character vector called person. Finally store the words Aggressive, Loyal, Regal, Cowardly, Proud, Wise, Conniving in a vector called description
age = c(21, 26, 51, 22, 160, 160, 160)
person = c("Achilles", "Hector", "Priam", "Paris", "Apollo", "Athena", "Aphrodite")
description = c("Aggressive", "Loyal", "Regal", "Cowardly", "Proud", "Wise", "Conniving")
  1. Create a data.table called trojan.war whose columns contain the vectors from the previous question.
trojan.war = data.table(person, age, description)
  1. Suppose you wanted to display only the column of trojan.war that contains each person’s description. What command would you use?
#There are many different ways to do this:
trojan.war[, 3] 
##    description
## 1:  Aggressive
## 2:       Loyal
## 3:       Regal
## 4:    Cowardly
## 5:       Proud
## 6:        Wise
## 7:   Conniving
trojan.war$description 
## [1] "Aggressive" "Loyal"      "Regal"      "Cowardly"   "Proud"     
## [6] "Wise"       "Conniving"
trojan.war[ , "description"]
##    description
## 1:  Aggressive
## 2:       Loyal
## 3:       Regal
## 4:    Cowardly
## 5:       Proud
## 6:        Wise
## 7:   Conniving
trojan.war[["description"]]
## [1] "Aggressive" "Loyal"      "Regal"      "Cowardly"   "Proud"     
## [6] "Wise"       "Conniving"
  1. What command would you use to show information for Achilles and Hector only?
#There are several ways to do this. Here are a few:
trojan.war[c(1,2)]
##      person age description
## 1: Achilles  21  Aggressive
## 2:   Hector  26       Loyal
trojan.war[1:2]
##      person age description
## 1: Achilles  21  Aggressive
## 2:   Hector  26       Loyal
#A more advanced way that doesn't require knowing the order of the rows:
trojan.war[person %in% c("Achilles", "Hector")]
##      person age description
## 1: Achilles  21  Aggressive
## 2:   Hector  26       Loyal
  1. What command would you use to display the person and description columns for Apollo, Athena and Aphrodite only?
#There are many ways to do this. Here are a few:
trojan.war[c(5, 6, 7), c(1, 3)]
##       person description
## 1:    Apollo       Proud
## 2:    Athena        Wise
## 3: Aphrodite   Conniving
trojan.war[5:7, c("person", "description")]
##       person description
## 1:    Apollo       Proud
## 2:    Athena        Wise
## 3: Aphrodite   Conniving
#advanced method
trojan.war[person %in% c("Apollo", "Athena", "Aphrodite"),
           c("person", "description")]
##       person description
## 1:    Apollo       Proud
## 2:    Athena        Wise
## 3: Aphrodite   Conniving
  1. By now you’re probably tired of this data set. A passenger manifest for the Titanic is stored at http://www.ditraglia.com/econ103/titanic3.csv. Read this file and store it in a dataframe called titanic.
titanic = fread("http://www.ditraglia.com/econ103/titanic3.csv")
  1. Calculate the product of all the even numbers between 2 and 18, inclusive.
x = seq(2, 18, 2)
x
## [1]  2  4  6  8 10 12 14 16 18
prod(x)
## [1] 185794560
  1. The column survived in the titanic data has a value of “1” to indicate that the passenger in that row survived the disaster. Display only the rows of titanic corresponding to passengers that survived.
titanic[survived == 1]
##      pclass survived                                            name
##   1:      1        1                   Allen, Miss. Elisabeth Walton
##   2:      1        1                  Allison, Master. Hudson Trevor
##   3:      1        1                             Anderson, Mr. Harry
##   4:      1        1               Andrews, Miss. Kornelia Theodosia
##   5:      1        1   Appleton, Mrs. Edward Dale (Charlotte Lamson)
##  ---                                                                
## 496:      3        1                          Turkula, Mrs. (Hedwig)
## 497:      3        1                            Vartanian, Mr. David
## 498:      3        1 Whabee, Mrs. George Joseph (Shawneene Abi-Saab)
## 499:      3        1                Wilkes, Mrs. James (Ellen Needs)
## 500:      3        1         Yasbeck, Mrs. Antoni (Selini Alexander)
##         sex   age sibsp parch ticket     fare   cabin embarked  boat body
##   1: female 29.00     0     0  24160 211.3375      B5        S     2     
##   2:   male  0.92     1     2 113781 151.5500 C22 C26        S    11     
##   3:   male 48.00     0     0  19952  26.5500     E12        S     3     
##   4: female 63.00     1     0  13502  77.9583      D7        S    10     
##   5: female 53.00     2     0  11769  51.4792    C101        S     D     
##  ---                                                                     
## 496: female 63.00     0     0   4134   9.5875                S    15     
## 497:   male 22.00     0     0   2658   7.2250                C 13 15     
## 498: female 38.00     0     0   2688   7.2292                C     C     
## 499: female 47.00     1     0 363272   7.0000                S           
## 500: female 15.00     1     0   2659  14.4542                C           
##                            home.dest
##   1:                    St Louis, MO
##   2: Montreal, PQ / Chesterville, ON
##   3:                    New York, NY
##   4:                      Hudson, NY
##   5:             Bayside, Queens, NY
##  ---                                
## 496:                                
## 497:                                
## 498:                                
## 499:                                
## 500:
  1. The column sex in the titanic data indicates each passenger’s sex. Display only the rows of titanic corresponding to men.
titanic[sex == 'male']
##      pclass survived                                 name  sex   age sibsp
##   1:      1        1       Allison, Master. Hudson Trevor male  0.92     1
##   2:      1        0 Allison, Mr. Hudson Joshua Creighton male 30.00     1
##   3:      1        1                  Anderson, Mr. Harry male 48.00     0
##   4:      1        0               Andrews, Mr. Thomas Jr male 39.00     0
##   5:      1        0              Artagaveytia, Mr. Ramon male 71.00     0
##  ---                                                                      
## 839:      3        0                    Yousif, Mr. Wazli male    NA     0
## 840:      3        0                Yousseff, Mr. Gerious male    NA     0
## 841:      3        0            Zakarian, Mr. Mapriededer male 26.50     0
## 842:      3        0                  Zakarian, Mr. Ortin male 27.00     0
## 843:      3        0                   Zimmerman, Mr. Leo male 29.00     0
##      parch   ticket     fare   cabin embarked boat body
##   1:     2   113781 151.5500 C22 C26        S   11     
##   2:     2   113781 151.5500 C22 C26        S       135
##   3:     0    19952  26.5500     E12        S    3     
##   4:     0   112050   0.0000     A36        S          
##   5:     0 PC 17609  49.5042                C        22
##  ---                                                   
## 839:     0     2647   7.2250                C          
## 840:     0     2627  14.4583                C          
## 841:     0     2656   7.2250                C       304
## 842:     0     2670   7.2250                C          
## 843:     0   315082   7.8750                S          
##                            home.dest
##   1: Montreal, PQ / Chesterville, ON
##   2: Montreal, PQ / Chesterville, ON
##   3:                    New York, NY
##   4:                     Belfast, NI
##   5:             Montevideo, Uruguay
##  ---                                
## 839:                                
## 840:                                
## 841:                                
## 842:                                
## 843: