This is a long tutorial, but the material is fairly straightforward. If you run into any trouble feel free to post on Piazza.
The most crucial piece of advice for learning a programming language is to recognize it requires the same approach as learning a foreign language – you’ll benefit most from being actively engaged in learning. That means not just reading along with these tutorials, but actively processing what it says and running the code yourself.
Carry out the following two steps in order
Go to http://cran.r-project.org/ and install the version of R for your operating system.
Go to http://rstudio.org/download/desktop and click the link listed under “Recommended for Your System”. Follow the instructions to install RStudio.
To make sure this worked, open the program RStudio and go to File > New > R Script. This will open a blank text document. In the document, type the text given in the box below and then click and drag to highlight both lines of code and click the button marked “Run.” If everything is working correctly, the console should display TRUE
.
x = 5
x == 5
Congratulations: you’ve just written your first R script! To save it, go to File > Save As, and choose a name. NOTE: Always save your scripts as .R files so they’ll open in RStudio by default.
Note that you can run one line of your script at a time by moving your cursor to that line and pressing CONTROL-ENTER or COMMAND-RETURN depending on whether you’re running Mac OSX, Linux or Windows. Another helpful shortcut is CONTROL-A (COMMAND-A on Mac) which highlights all of the lines of code in the text editor.
Here are some of the most fundamental things you can do with R.
#add numbers
1 + 1
## [1] 2
#subtract them
8 - 4
## [1] 4
#divide
13/2
## [1] 6.5
#multiply
4*pi
## [1] 12.56637
#exponentiate
2^10
## [1] 1024
3 < 4
## [1] TRUE
3 > 4
## [1] FALSE
#contrast with 3 = 4; see section about variables below
3 == 4
## [1] FALSE
#!= means "not equal to"
3 != 4
## [1] TRUE
4 >= 5
## [1] FALSE
4 <= 5
## [1] TRUE
2 + 2 == 5
## [1] FALSE
10 - 6 == 4
## [1] TRUE
Numbers are bread and butter for computers, but text is what will facilitate understanding for us mere mortals.
'Econometrics is awesome'
## [1] "Econometrics is awesome"
#R delimits strings with EITHER double or single quotes.
# There is only a very minimal difference
"Econometrics is still awesome"
## [1] "Econometrics is still awesome"
Just like in algebra, variables are a great form of shorthand. Instead of writing 3.1415926… all the time, we can just write pi
.
Assignment to a variable happens from right to left – the value on the right side gets assigned to the name on the left side. You can use nearly anything as a variable name in R. The only rules are:
.
and _
are OK, but no other symbols._
(2squared
and _one
are illegal).[A note for those of you who have programming experience: while R supports object-oriented programming, periods .
do not have a special meaning in the language. For historical reasons, R programmers often use periods in place of underscores in variable names, but either works. Just be consistent to keep your code readable.]
x = 42
x / 2
## [1] 21
#if we assign something else to x,
# the old value is deleted
x = "Melody to Funkytown!"
x
## [1] "Melody to Funkytown!"
x = 5
x == 5
## [1] TRUE
foo = 3
bar = 5
foo.bar = foo + bar
foo.bar
## [1] 8
foo.bar2 = 2 * foo.bar
foo.bar2
## [1] 16
foo_bar = foo - bar
foo_bar
## [1] -2
Note: In programmer speak, =
here is an “assignment operator” – it’s the thing used to assign values to a variable name. R also has a second assignment operator that you’re bound to see sooner or later, <-
. So x <- 42
and x = 42
are identical, and both accomplish the task of assigning the value of 42
to the name x
. We’ll try to stick with using =
since it’s easier to type and in some ways more intuitive. See this wonderful post for some more history and a very subtle difference between the two operators that you needn’t concern yourself with for now.
In R, a vector is just a (ordered) set of related things. You should basically think of it like a column in Excel.
x = c(4, 7, 9)
x
## [1] 4 7 9
y = c('a', 'b', 'c')
y
## [1] "a" "b" "c"
4
, 7
, and 9
are “related” because they’re all numbers; a
, b
and c
are all letters. Having variables is becoming more convenient – instead of having to write c(4, 7, 9)
all the time, we can just write x
.
What happens when we try and combine things that aren’t so obviously related?
x = c(1, TRUE, "three")
x
## [1] "1" "TRUE" "three"
Note the quotation marks. R has converted 1
and TRUE
into text representations. That’s because 1
and TRUE
are different _type_s than "three"
. There are four basic types of variables your likely to encounter in this class, listed here in heirarchical order:
logical
: TRUE
or FALSE
integer
: 0L
, -1L
, 1L
, etc. A (real) number without a decimal part. Technical note: they take up less space in the computer than numbers with decimals.numeric
: pi
, 0.34
, 1.4043
, etc. A real number.character
: "some words"
, "more words"
, etc.Vectors are converted to the highest number on this list present – x
above has "three"
so the whole vector becomes a character
.
Vectors make it easy to do many computations all at once – adding one to a list of numbers, dividing all of them by 3, etc. And as long as two vectors are the same length, we can combine them in natural ways:
x = c(1, 2, 3)
x + 4
## [1] 5 6 7
x/3
## [1] 0.3333333 0.6666667 1.0000000
-x
## [1] -1 -2 -3
x^3
## [1] 1 8 27
y = c(3, 2, 1)
x - y
## [1] -2 0 2
x * y
## [1] 3 4 3
x/y
## [1] 0.3333333 1.0000000 3.0000000
x > 2
## [1] FALSE FALSE TRUE
x >= 2
## [1] FALSE TRUE TRUE
Just like in math, a function is a way of mapping input to output, and just like in most math classes, you can spot functions since they use parentheses: ()
. We’ve already seen the _c_oncatenate function c
used (for example) to create vectors.
We can also apply any number of ubiquitous functions to our vector input. Just a small taste:
x = c(1, 2, 3)
#sum: add up the elements of a vector
sum(x)
## [1] 6
#Just like you can use the command sum to add up the
# elements of a numeric vector, you can use
# prod to take their product:
prod(x)
## [1] 6
sqrt(x)
## [1] 1.000000 1.414214 1.732051
y = c(-1, 2, 4)
#abs: absolute value
abs(y)
## [1] 1 2 4
#exp: exponential. exp(x) is e^x
exp(y)
## [1] 0.3678794 7.3890561 54.5981500
#log: _natural_ logarithm (base e)
log(x)
## [1] 0.0000000 0.6931472 1.0986123
#Note that these functions interpret their input
# as *radians* rather than degrees.
sin(x) + cos(y)
## [1] 1.3817733 0.4931506 -0.5125236
max(y)
## [1] 4
min(y)
## [1] -1
range(y)
## [1] -1 4
mean(x)
## [1] 2
median(x)
## [1] 2
Another thing that we will do all the time is use regularly-spaced sequences of numbers. These are created in R with :
or seq
:
x = 1:10
x
## [1] 1 2 3 4 5 6 7 8 9 10
y = 10:1
y
## [1] 10 9 8 7 6 5 4 3 2 1
#some times the gap is not 1
z = seq(0, 1, by = .02)
z
## [1] 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26
## [15] 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 0.50 0.52 0.54
## [29] 0.56 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.74 0.76 0.78 0.80 0.82
## [43] 0.84 0.86 0.88 0.90 0.92 0.94 0.96 0.98 1.00
#other times we care less about the gap and more
# more about how many points we get out
w = seq(0, 1, length.out = 20)
In addition to math/arithmetic functions, there is a litany of basic programming functions that you’re likely to use all of the time:
x = 99:32
#length: how many elements (items) are there in x?
length(x)
## [1] 68
y = c("hey you!", "out there in the cold")
#what TYPE of variable does R think this is?
class(y)
## [1] "character"
#rep: repeat/reproduce
rep(y, 4)
## [1] "hey you!" "out there in the cold" "hey you!"
## [4] "out there in the cold" "hey you!" "out there in the cold"
## [7] "hey you!" "out there in the cold"
#head/tail: display only the beginning/end
# of an object -- very useful for very
# large objects
x = 1:100000
head(x)
## [1] 1 2 3 4 5 6
tail(x)
## [1] 99995 99996 99997 99998 99999 100000
[
Often we want to examine only part of a vector, most commonly the part of a vector that satisfies some condition, but also looking at the first or last few elements. To do this we extract or subset those elements by using [
:
x = c(5, 4, 1)
x[1]
## [1] 5
x[3]
## [1] 1
x[1:2]
## [1] 5 4
x[2:3]
## [1] 4 1
In the syntax x[something]
, note that something
is itself a vector! So the above is all short-hand for the more complicated types of subsets:
x = 20:30
x
## [1] 20 21 22 23 24 25 26 27 28 29 30
x[c(1, 3, 5)]
## [1] 20 22 24
x[c(5, 9)]
## [1] 24 28
x[seq(1, 10, by = 2)]
## [1] 20 22 24 26 28
Besides being an integer
, something
can be a logical vector of the same length as the vector itself:
x = c(5, 6, 7)
x[c(TRUE, TRUE, FALSE)]
## [1] 5 6
x[c(FALSE, TRUE, FALSE)]
## [1] 6
x[c(FALSE, FALSE, TRUE)]
## [1] 7
Most commonly we’ll do something that’s identical to the above but reads more naturally:
x = c(-1, 0, 1)
x > 0
## [1] FALSE FALSE TRUE
x[x > 0]
## [1] 1
x[x <= 0]
## [1] -1 0
We can also replace parts of a vector by subsetting:
x = c(-1, 5, 10)
x[3] = 4
x
## [1] -1 5 4
x[x < 0] = 0
It’s also often useful to name our vectors to help organize the information. Suppose we were keeping track of the ages of the Trumps:
trump_ages = c(70, 46, 38, 34, 32, 22, 9)
This is nice, but much more useful if we keep track of who each element represents:
trump_ages = c(Donald = 70, Melania = 46, Donald_Jr = 38, Ivanka = 34,
Eric = 32, Tiffany = 22, Barron = 9)
trump_ages
## Donald Melania Donald_Jr Ivanka Eric Tiffany Barron
## 70 46 38 34 32 22 9
We can also use the names
function to assign names; this is sometimes easier, e.g., if the names have spaces:
names(trump_ages) = c("Donald", "Melania", "Donald, Jr.", "Ivanka", "Eric", "Tiffany", "Barron")
trump_ages
## Donald Melania Donald, Jr. Ivanka Eric Tiffany
## 70 46 38 34 32 22
## Barron
## 9
This also makes code for subsetting much easier to read, since we can subset by the names:
trump_ages["Donald"]
## Donald
## 70
trump_ages[c("Donald", "Barron")]
## Donald Barron
## 70 9
If you’re unsure of how something works in R – what the arguments are to a function, how it works, etc. – your first step is to check the documentation:
?sum
?cos
?"="
We saw above that R doesn’t like vectors to have different types: c(TRUE, 1, "Frank")
becomes c("TRUE", "1", "Frank")
. But storing objects with different types is absolutely fundamental to data analysis.
R has a different type of object besides a vector used to store data of different types side-by-side: a list
:
x = list(TRUE, 1, "Frank")
x
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] 1
##
## [[3]]
## [1] "Frank"
Note how different the output looks, as compared to using c
!! The quotation marks are gone except for the last component. You can ignore the mess of [[
and [
for now, but as an intimation, consider some more complicated lists:
x = list(c(1, 2), c("a", "b"), c(TRUE, FALSE), c(5L, 6L))
x
## [[1]]
## [1] 1 2
##
## [[2]]
## [1] "a" "b"
##
## [[3]]
## [1] TRUE FALSE
##
## [[4]]
## [1] 5 6
y = list(list(1, 2, 3), list(4:5), 6)
y
## [[1]]
## [[1]][[1]]
## [1] 1
##
## [[1]][[2]]
## [1] 2
##
## [[1]][[3]]
## [1] 3
##
##
## [[2]]
## [[2]][[1]]
## [1] 4 5
##
##
## [[3]]
## [1] 6
x
is a list which has 4 components, each of which is a vector with 2 components. This gives the first hint at how R treats a dataset with many variables of different types – at core, R stores a data set in a list!
y
is a nested list – it’s a list that has lists for some of its components. This is very useful for more advanced operations, but probably won’t come up for quite some time, so don’t worry if you haven’t wrapped your head around this yet.
One of the things that makes R truly exceptional is its vast library of user-contributed packages.
R comes pre-loaded with a boat-load of the most common functions / methods of analysis. But in no way is this congenital library complete.
Complementing this core of the most common operations are external packages, which are basically sets of functions designed to accomplish specific tasks.
Best of all, unlike some super-expensive programming languages, all of the thousands of packages available to R users (most importantly through CRAN, the Comprehensive R Archive Network) are completely free of charge.
The two most important things to know about packages for now is where to find them, how to install them, and how to load them.
We’ll work extensively with the data.table
package, which was built for working with huge data sets.
Long story short: Google. Got a particular statistical technique in mind? The best R package for this is almost always the top Google result if asked correctly.
Just use install.packages
!
install.packages("data.table")
This will download the code from the package to your computer to a place that R understands.
We do not yet have access to the functions in the package. We have to load it first.
Just add it to your library!
library(data.table)
Et voila! You’ll now have access to all of the awesome functions in the data.table
package. You can also Google “tutorial data.table” (or in general “tutorial [package name]”) and you’re very likely to find a trove of sites trying to help you learn the package.
data.table
sData sets are the lifeblood of a data lover!
As mentioned above, data sets in R basically list
s where every element has the same length. In basic R, this is done with a data.frame
, but it’ll be easier for a beginner to understand the syntax of a data.table
, so you can forget about data.frame
s for now.
We can build a data.table
from scratch with the data.table
command. This command lets you build up a data.table
from several vectors of the same length:
foo = 1:5
bar = 2 * foo
foo.bar = data.table(foo, bar)
foo.bar
## foo bar
## 1: 1 2
## 2: 2 4
## 3: 3 6
## 4: 4 8
## 5: 5 10
In the preceding example I built a data.table
with only two columns, but you can add as many as you like. Just separate them by commas:
y = -4:0
data.table(foo, bar, y)
## foo bar y
## 1: 1 2 -4
## 2: 2 4 -3
## 3: 3 6 -2
## 4: 4 8 -1
## 5: 5 10 0
When you’re working with data, you’ll often want to look at subsets that satisfy a particular condition. First we’ll set up a simple data.table
:
location = c("New York", "Chicago", "Boston", "Boston", "New York")
salary = c(70000, 80000, 60000, 50000, 45000)
title = c("Office Manager", "Research Assistant", "Analyst", "Office Manager", "Analyst")
hours = c(50, 56, 65, 40, 50)
jobsearch = data.table(location, salary, title, hours)
jobsearch
## location salary title hours
## 1: New York 70000 Office Manager 50
## 2: Chicago 80000 Research Assistant 56
## 3: Boston 60000 Analyst 65
## 4: Boston 50000 Office Manager 40
## 5: New York 45000 Analyst 50
Now, suppose you wanted to see only the jobs in New York. You could select them as follows:
jobsearch[location == 'New York']
## location salary title hours
## 1: New York 70000 Office Manager 50
## 2: New York 45000 Analyst 50
Notice the use of the double equal sign. This command is testing a logical condition. If you use a single equals sign, this won’t work since =
is what is used to name the arguments to a function in R. The preceding command looks at the data.table
jobsearch
and then the column location
and checks which entries satisfy the condition that the location
is "New York"
. Finally, the function returns only these rows of the data.table
.
Now suppose you wanted to extract only those jobs that pay more than $50,000. The command for this is as follows:
jobsearch[salary > 50000]
## location salary title hours
## 1: New York 70000 Office Manager 50
## 2: Chicago 80000 Research Assistant 56
## 3: Boston 60000 Analyst 65
Finally, suppose the most you’re willing to work per week is 50 hours. Here are the jobs you should consider:
jobsearch[hours <= 50]
## location salary title hours
## 1: New York 70000 Office Manager 50
## 2: Boston 50000 Office Manager 40
## 3: New York 45000 Analyst 50
The vast majority of the time, you won’t be using data that you type in by hand – you’ll be importing data from external sources. One of the most common ways to find such data is in comma-separated format – such files are structured such that each row represents a row of data, and columns are separated by a comma (actually, any separating character is possible), e.g., like this:
name,age,company
Mike,24,BCG
Rodrigo,25,Uber
Frank,28,FMC
Ethan,22,AirBnB
It’s very easy to read files like this into R very quickly using fread
. The weather site Weather Underground offers lots of historical data in such tabular format. E.g., the data on this page about the weather recorded at Philadelphia International Airport is available as a .csv online here.
We can read this into R like so:
weather = fread('http://michaelchirico.github.io/philly_weather_data.csv')
weather
## EST Max TemperatureF Mean TemperatureF Min TemperatureF
## 1: 2017-1-1 51 42 34
## 2: 2017-1-2 43 38 32
## 3: 2017-1-3 48 44 41
## 4: 2017-1-4 54 45 35
## 5: 2017-1-5 34 31 28
## 6: 2017-1-6 32 29 26
## 7: 2017-1-7 24 22 19
## 8: 2017-1-8 24 20 15
## 9: 2017-1-9 23 18 12
## 10: 2017-1-10 42 28 14
## 11: 2017-1-11 54 46 37
## 12: 2017-1-12 66 56 45
## 13: 2017-1-13 61 48 35
## 14: 2017-1-14 36 33 30
## Max Dew PointF MeanDew PointF Min DewpointF Max Humidity Mean Humidity
## 1: 29 26 24 75 48
## 2: 41 37 25 96 88
## 3: 46 42 39 97 93
## 4: 46 36 10 96 74
## 5: 27 17 8 92 58
## 6: 28 21 9 93 70
## 7: 18 13 4 89 68
## 8: 7 3 -1 57 46
## 9: 9 6 4 73 56
## 10: 32 18 9 84 62
## 11: 46 40 33 93 80
## 12: 52 50 46 90 74
## 13: 50 30 14 89 63
## 14: 23 17 11 61 50
## Min Humidity Max Sea Level PressureIn Mean Sea Level PressureIn
## 1: 30 30.42 30.19
## 2: 66 30.45 30.35
## 3: 89 30.18 29.79
## 4: 14 29.83 29.58
## 5: 23 30.05 29.98
## 6: 31 30.31 30.05
## 7: 33 30.33 30.24
## 8: 29 30.69 30.44
## 9: 36 30.81 30.72
## 10: 38 30.66 30.48
## 11: 59 30.37 30.26
## 12: 50 30.19 30.11
## 13: 36 30.71 30.45
## 14: 30 30.71 30.62
## Min Sea Level PressureIn Max VisibilityMiles Mean VisibilityMiles
## 1: 29.94 10 10
## 2: 30.19 10 6
## 3: 29.53 10 4
## 4: 29.48 10 6
## 5: 29.85 10 8
## 6: 29.92 10 7
## 7: 30.18 10 5
## 8: 30.25 10 10
## 9: 30.67 10 10
## 10: 30.20 10 10
## 11: 30.19 10 8
## 12: 30.02 10 10
## 13: 30.14 10 10
## 14: 30.49 10 10
## Min VisibilityMiles Max Wind SpeedMPH Mean Wind SpeedMPH
## 1: 10 18 12
## 2: 2 15 8
## 3: 2 15 11
## 4: 2 24 12
## 5: 1 21 11
## 6: 0 13 5
## 7: 0 17 13
## 8: 10 24 15
## 9: 10 10 7
## 10: 10 17 7
## 11: 5 22 12
## 12: 6 21 15
## 13: 10 24 13
## 14: 6 10 5
## Max Gust SpeedMPH PrecipitationIn CloudCover Events WindDirDegrees
## 1: 25 0.00 4 274
## 2: 25 0.16 8 Rain 52
## 3: 28 0.20 8 Rain 31
## 4: 34 0.01 6 Rain 250
## 5: 28 0.03 7 Snow 253
## 6: 17 0.04 8 Snow 336
## 7: 25 0.08 7 Fog-Snow 1
## 8: 31 0.00 1 294
## 9: NA 0.00 4 228
## 10: 29 0.01 7 175
## 11: 30 0.24 7 Rain 212
## 12: 29 0.03 7 Rain 218
## 13: 30 T 6 326
## 14: NA 0.00 8 Snow 67
summary(weather)
## EST Max TemperatureF Mean TemperatureF Min TemperatureF
## Length:14 Min. :23.00 Min. :18.00 Min. :12.00
## Class :character 1st Qu.:32.50 1st Qu.:28.25 1st Qu.:20.75
## Mode :character Median :42.50 Median :35.50 Median :31.00
## Mean :42.29 Mean :35.71 Mean :28.79
## 3rd Qu.:53.25 3rd Qu.:44.75 3rd Qu.:35.00
## Max. :66.00 Max. :56.00 Max. :45.00
##
## Max Dew PointF MeanDew PointF Min DewpointF Max Humidity
## Min. : 7.00 Min. : 3.00 Min. :-1.00 Min. :57.00
## 1st Qu.:24.00 1st Qu.:17.00 1st Qu.: 8.25 1st Qu.:77.25
## Median :30.50 Median :23.50 Median :10.50 Median :89.50
## Mean :32.43 Mean :25.43 Mean :16.79 Mean :84.64
## 3rd Qu.:46.00 3rd Qu.:36.75 3rd Qu.:24.75 3rd Qu.:93.00
## Max. :52.00 Max. :50.00 Max. :46.00 Max. :97.00
##
## Mean Humidity Min Humidity Max Sea Level PressureIn
## Min. :46.00 Min. :14.00 Min. :29.83
## 1st Qu.:56.50 1st Qu.:30.00 1st Qu.:30.22
## Median :65.50 Median :34.50 Median :30.39
## Mean :66.43 Mean :40.29 Mean :30.41
## 3rd Qu.:74.00 3rd Qu.:47.00 3rd Qu.:30.68
## Max. :93.00 Max. :89.00 Max. :30.81
##
## Mean Sea Level PressureIn Min Sea Level PressureIn Max VisibilityMiles
## Min. :29.58 Min. :29.48 Min. :10
## 1st Qu.:30.07 1st Qu.:29.93 1st Qu.:10
## Median :30.25 Median :30.16 Median :10
## Mean :30.23 Mean :30.07 Mean :10
## 3rd Qu.:30.45 3rd Qu.:30.20 3rd Qu.:10
## Max. :30.72 Max. :30.67 Max. :10
##
## Mean VisibilityMiles Min VisibilityMiles Max Wind SpeedMPH
## Min. : 4.000 Min. : 0.000 Min. :10.00
## 1st Qu.: 6.250 1st Qu.: 2.000 1st Qu.:15.00
## Median : 9.000 Median : 5.500 Median :17.50
## Mean : 8.143 Mean : 5.286 Mean :17.93
## 3rd Qu.:10.000 3rd Qu.:10.000 3rd Qu.:21.75
## Max. :10.000 Max. :10.000 Max. :24.00
##
## Mean Wind SpeedMPH Max Gust SpeedMPH PrecipitationIn CloudCover
## Min. : 5.00 Min. :17.00 Length:14 Min. :1.000
## 1st Qu.: 7.25 1st Qu.:25.00 Class :character 1st Qu.:6.000
## Median :11.50 Median :28.50 Mode :character Median :7.000
## Mean :10.43 Mean :27.58 Mean :6.286
## 3rd Qu.:12.75 3rd Qu.:30.00 3rd Qu.:7.750
## Max. :15.00 Max. :34.00 Max. :8.000
## NA's :2
## Events WindDirDegrees
## Length:14 Min. : 1.0
## Class :character 1st Qu.: 94.0
## Mode :character Median :223.0
## Mean :194.1
## 3rd Qu.:268.8
## Max. :336.0
##
names(weather)
## [1] "EST" "Max TemperatureF"
## [3] "Mean TemperatureF" "Min TemperatureF"
## [5] "Max Dew PointF" "MeanDew PointF"
## [7] "Min DewpointF" "Max Humidity"
## [9] "Mean Humidity" "Min Humidity"
## [11] "Max Sea Level PressureIn" "Mean Sea Level PressureIn"
## [13] "Min Sea Level PressureIn" "Max VisibilityMiles"
## [15] "Mean VisibilityMiles" "Min VisibilityMiles"
## [17] "Max Wind SpeedMPH" "Mean Wind SpeedMPH"
## [19] "Max Gust SpeedMPH" "PrecipitationIn"
## [21] "CloudCover" "Events"
## [23] "WindDirDegrees"
NB: More typically, instead of a URL, you’ll give fread
the path to where a .csv file is stored on your computer.
data.table
AccessOne of the most fundamental R skills you’ll need to learn is how to access parts of a data.table
or vector. This can be a little confusing at first since there are usually several different ways to accomplish the same thing. This section is intended to add more clarity to some of the material on data.table
s from above. The best way to learn this is to play around with different commands and see what happens. There are some exercises at the end of the tutorial to do just that. If you don’t get the result you were expecting, try to think about why.
First I’ll build a simple data.table
from the following vectors:
person = c("Linus", "Snoopy", "Lucy", "Woodstock")
age = c(5, 8, 6, 2)
weight = c(40, 25, 50, 1)
my.data.table = data.table(person, age, weight)
my.data.table
## person age weight
## 1: Linus 5 40
## 2: Snoopy 8 25
## 3: Lucy 6 50
## 4: Woodstock 2 1
data.table
s by positionThe only real difference here is that vectors are one-dimensional
age[1:2]
## [1] 5 8
age[c(1,3)]
## [1] 5 6
whereas data.table
s are two-dimensional; the first dimension is rows:
my.data.table[1:2]
## person age weight
## 1: Linus 5 40
## 2: Snoopy 8 25
my.data.table[c(1, 3)]
## person age weight
## 1: Linus 5 40
## 2: Lucy 6 50
The second dimension is columns; we can specify rows and columns by giving two numbers inside []
:
#what is the first row of the third column?
my.data.table[1, 3]
## weight
## 1: 40
#what are the first three rows of the third column?
my.data.table[1:3, 3]
## weight
## 1: 40
## 2: 25
## 3: 50
If you leave the part before the comma blank, you get all the rows:
my.data.table[ , 2:3]
## age weight
## 1: 5 40
## 2: 8 25
## 3: 6 50
## 4: 2 1
If you leave the part after the comma blank (or don’t include it at all), you get all the columns:
my.data.table[c(1,3), ]
## person age weight
## 1: Linus 5 40
## 2: Lucy 6 50
my.data.table[c(2,4)]
## person age weight
## 1: Snoopy 8 25
## 2: Woodstock 2 1
data.table
by name.The first way is to use [["COLUMN NAME GOES HERE"]]
my.data.table[["weight"]]
## [1] 40 25 50 1
The second is to use $
, which is often faster to type since it doesn’t require the use of quotation marks:
my.data.table$weight
## [1] 40 25 50 1
Both of the preceding methods are limited in that they only allow us to reference a single column. We can reference multiple columns as follows:
my.data.table[ , c("person", "weight")]
## person weight
## 1: Linus 40
## 2: Snoopy 25
## 3: Lucy 50
## 4: Woodstock 1
Since we left the part before the comma blank, this gave us all the rows. We could get the same thing by accessing these columns by position (though this is generally not recommended)
my.data.table[ , 2]
## age
## 1: 5
## 2: 8
## 3: 6
## 4: 2
my.data.table[ , c(1,2)]
## person age
## 1: Linus 5
## 2: Snoopy 8
## 3: Lucy 6
## 4: Woodstock 2
my.data.table[ , 1:2]
## person age
## 1: Linus 5
## 2: Snoopy 8
## 3: Lucy 6
## 4: Woodstock 2
In some cases it’s easier to access columns of a data.table
by name and in others it’s easier to access them by position.
If you can’t get R and RStudio to work on your computer, you can do the exercises on the R Fiddle website
60 * 24 * 7 * 31
## [1] 312480
sum(c(3,1,4,1,5,9,2,6))
## [1] 31
summary
, and use summary
on an object.?summary
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
x = 5
y = 7
z = x + y
z + 3 == 15
x = 5
y = 7
z = x + y
z + 3 == 15
## [1] TRUE
"Go Penn!"
thirty times without repeatedly typing this by hand?rep("Go Penn", times = 30)
## [1] "Go Penn" "Go Penn" "Go Penn" "Go Penn" "Go Penn" "Go Penn" "Go Penn"
## [8] "Go Penn" "Go Penn" "Go Penn" "Go Penn" "Go Penn" "Go Penn" "Go Penn"
## [15] "Go Penn" "Go Penn" "Go Penn" "Go Penn" "Go Penn" "Go Penn" "Go Penn"
## [22] "Go Penn" "Go Penn" "Go Penn" "Go Penn" "Go Penn" "Go Penn" "Go Penn"
## [29] "Go Penn" "Go Penn"
x
containing the sequence -1, -0.9, … 0, 0.1, …, 0.9, 1 and then display the resultx = seq(-1, 1, 0.1)
x
## [1] -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3
## [15] 0.4 0.5 0.6 0.7 0.8 0.9 1.0
wizards
and ranking
. The vector wizards
should contain the following names: Harry, Ron, Fred, George, Sirius. The vector ranking
should contain the following numbers: 4, 2, 5, 1, 3 in it. Make sure to put these in order.#Remember that the elements of character vectors need to be enclosed in quotation marks. Either single or double quotes will work.
wizards = c("Harry", "Ron", "Fred", "George", "Sirius")
ranking = c(4, 2, 5, 1, 3)
wizards
.wizards[2]
## [1] "Ron"
#There are several different ways to do this. Here are two possibilities.
wizards[c(3, 4, 5)] = c("Hermione", "Ginny", "Malfoy")
wizards[3:5] = c("Hermione", "Ginny", "Malfoy")
wizards
: Lead, Friend, Friend, Wife, Rival. Display the result.names(wizards) = c("Lead", "Friend", "Friend", "Wife", "Rival")
wizards
## Lead Friend Friend Wife Rival
## "Harry" "Ron" "Hermione" "Ginny" "Malfoy"
names(wizards)[5] = "Ex-Rival"
names(wizards)
## [1] "Lead" "Friend" "Friend" "Wife" "Ex-Rival"
years
, income
and expenses
.years = c(2009, 2010, 2011, 2012)
income = c(50000, 52000, 52500, 48000)
expenses = c(35000, 34000, 38000, 40000)
savings
.savings = income - expenses
sum(savings)
## [1] 55500
z
that lists the numbers from 12 to 23 in descending order.z = 23:12
z
## [1] 23 22 21 20 19 18 17 16 15 14 13 12
z
.z[z == 13] = 7
z
## [1] 23 22 21 20 19 18 17 16 15 14 7 12
scores
.scores = c(18, 95, 76, 90, 84, 83, 80, 79, 63, 76, 55, 78, 90, 81, 88, 89, 92, 73, 83, 72, 85, 66, 77, 82, 99, 87)
mean(scores)
## [1] 78.5
median(scores)
## [1] 81.5
range(scores)
## [1] 18 99
age
. Next, store the names Achilles, Hector, Priam, Paris, Apollo, Athena, Aphrodite in a character vector called person
. Finally store the words Aggressive, Loyal, Regal, Cowardly, Proud, Wise, Conniving in a vector called description
age = c(21, 26, 51, 22, 160, 160, 160)
person = c("Achilles", "Hector", "Priam", "Paris", "Apollo", "Athena", "Aphrodite")
description = c("Aggressive", "Loyal", "Regal", "Cowardly", "Proud", "Wise", "Conniving")
data.table
called trojan.war
whose columns contain the vectors from the previous question.trojan.war = data.table(person, age, description)
trojan.war
that contains each person’s description
. What command would you use?#There are many different ways to do this:
trojan.war[, 3]
## description
## 1: Aggressive
## 2: Loyal
## 3: Regal
## 4: Cowardly
## 5: Proud
## 6: Wise
## 7: Conniving
trojan.war$description
## [1] "Aggressive" "Loyal" "Regal" "Cowardly" "Proud"
## [6] "Wise" "Conniving"
trojan.war[ , "description"]
## description
## 1: Aggressive
## 2: Loyal
## 3: Regal
## 4: Cowardly
## 5: Proud
## 6: Wise
## 7: Conniving
trojan.war[["description"]]
## [1] "Aggressive" "Loyal" "Regal" "Cowardly" "Proud"
## [6] "Wise" "Conniving"
#There are several ways to do this. Here are a few:
trojan.war[c(1,2)]
## person age description
## 1: Achilles 21 Aggressive
## 2: Hector 26 Loyal
trojan.war[1:2]
## person age description
## 1: Achilles 21 Aggressive
## 2: Hector 26 Loyal
#A more advanced way that doesn't require knowing the order of the rows:
trojan.war[person %in% c("Achilles", "Hector")]
## person age description
## 1: Achilles 21 Aggressive
## 2: Hector 26 Loyal
person
and description
columns for Apollo, Athena and Aphrodite only?#There are many ways to do this. Here are a few:
trojan.war[c(5, 6, 7), c(1, 3)]
## person description
## 1: Apollo Proud
## 2: Athena Wise
## 3: Aphrodite Conniving
trojan.war[5:7, c("person", "description")]
## person description
## 1: Apollo Proud
## 2: Athena Wise
## 3: Aphrodite Conniving
#advanced method
trojan.war[person %in% c("Apollo", "Athena", "Aphrodite"),
c("person", "description")]
## person description
## 1: Apollo Proud
## 2: Athena Wise
## 3: Aphrodite Conniving
titanic
.titanic = fread("http://www.ditraglia.com/econ103/titanic3.csv")
x = seq(2, 18, 2)
x
## [1] 2 4 6 8 10 12 14 16 18
prod(x)
## [1] 185794560
survived
in the titanic
data has a value of “1” to indicate that the passenger in that row survived the disaster. Display only the rows of titanic
corresponding to passengers that survived.titanic[survived == 1]
## pclass survived name
## 1: 1 1 Allen, Miss. Elisabeth Walton
## 2: 1 1 Allison, Master. Hudson Trevor
## 3: 1 1 Anderson, Mr. Harry
## 4: 1 1 Andrews, Miss. Kornelia Theodosia
## 5: 1 1 Appleton, Mrs. Edward Dale (Charlotte Lamson)
## ---
## 496: 3 1 Turkula, Mrs. (Hedwig)
## 497: 3 1 Vartanian, Mr. David
## 498: 3 1 Whabee, Mrs. George Joseph (Shawneene Abi-Saab)
## 499: 3 1 Wilkes, Mrs. James (Ellen Needs)
## 500: 3 1 Yasbeck, Mrs. Antoni (Selini Alexander)
## sex age sibsp parch ticket fare cabin embarked boat body
## 1: female 29.00 0 0 24160 211.3375 B5 S 2
## 2: male 0.92 1 2 113781 151.5500 C22 C26 S 11
## 3: male 48.00 0 0 19952 26.5500 E12 S 3
## 4: female 63.00 1 0 13502 77.9583 D7 S 10
## 5: female 53.00 2 0 11769 51.4792 C101 S D
## ---
## 496: female 63.00 0 0 4134 9.5875 S 15
## 497: male 22.00 0 0 2658 7.2250 C 13 15
## 498: female 38.00 0 0 2688 7.2292 C C
## 499: female 47.00 1 0 363272 7.0000 S
## 500: female 15.00 1 0 2659 14.4542 C
## home.dest
## 1: St Louis, MO
## 2: Montreal, PQ / Chesterville, ON
## 3: New York, NY
## 4: Hudson, NY
## 5: Bayside, Queens, NY
## ---
## 496:
## 497:
## 498:
## 499:
## 500:
sex
in the titanic
data indicates each passenger’s sex. Display only the rows of titanic
corresponding to men.titanic[sex == 'male']
## pclass survived name sex age sibsp
## 1: 1 1 Allison, Master. Hudson Trevor male 0.92 1
## 2: 1 0 Allison, Mr. Hudson Joshua Creighton male 30.00 1
## 3: 1 1 Anderson, Mr. Harry male 48.00 0
## 4: 1 0 Andrews, Mr. Thomas Jr male 39.00 0
## 5: 1 0 Artagaveytia, Mr. Ramon male 71.00 0
## ---
## 839: 3 0 Yousif, Mr. Wazli male NA 0
## 840: 3 0 Yousseff, Mr. Gerious male NA 0
## 841: 3 0 Zakarian, Mr. Mapriededer male 26.50 0
## 842: 3 0 Zakarian, Mr. Ortin male 27.00 0
## 843: 3 0 Zimmerman, Mr. Leo male 29.00 0
## parch ticket fare cabin embarked boat body
## 1: 2 113781 151.5500 C22 C26 S 11
## 2: 2 113781 151.5500 C22 C26 S 135
## 3: 0 19952 26.5500 E12 S 3
## 4: 0 112050 0.0000 A36 S
## 5: 0 PC 17609 49.5042 C 22
## ---
## 839: 0 2647 7.2250 C
## 840: 0 2627 14.4583 C
## 841: 0 2656 7.2250 C 304
## 842: 0 2670 7.2250 C
## 843: 0 315082 7.8750 S
## home.dest
## 1: Montreal, PQ / Chesterville, ON
## 2: Montreal, PQ / Chesterville, ON
## 3: New York, NY
## 4: Belfast, NI
## 5: Montevideo, Uruguay
## ---
## 839:
## 840:
## 841:
## 842:
## 843: