hist
plot
pairs
pairs
boxplot
table
summary
NA
mean
var
sd
median
quantile
IQR
max
, min
, range
, which.max
and which.min
factor
sby
cut
This tutorial has three parts. In the first part, you’ll learn how to implement the summary statistics and graphics from class using R. In the second part, you’ll use what you’ve learned to explore a real-world dataset: the passenger manifest from the Titanic.
In this section we’ll work with data from the student survey. Since results for this semester haven’t been collected yet, we’ll start with data from a previous semester:
#You should have data.table installed from the first R tutorial.
# if not, run the following line of code before library:
# install.packages("data.table")
library(data.table)
survey = fread("http://www.ditraglia.com/econ103/old_survey.csv")
Adding and subtracting columns from a data.table
is done with the special assignment operator :=
, which I read as “defined as”. There are other ways to assign and remove columns from a data.table
or data.frame
, but there are issues with this approach; if you’re interested, see here, here and here for more background on this issue.
Let’s add a column to survey
that is the ratio of height
to handspan
:
survey[ , height_handspan_ratio := height/handspan]
As always with data.table
s, when we’re doing an operation specifying something about columns, we do so in the second argument (often referred to as simply j
) within the square brackets []
. The first argument (similarly, i
) tells R which rows we’re interested in; by leaving it blank, we’re telling R to operate on all rows.
Removing a column from a data.table
shares basically the same syntax as adding a columns – the crux is the :=
operator. We’ll exclude columns 7 through 10:
survey[ , 7:10 := NULL]
There are two new things going on in this compact line:
:=
, we’re adding/subtract more than one column. We can tell this by knowing that 7:10
is really c(7, 8, 9, 10)
, which means :=
is being done to the 7th, 8th, 9th, and 10th columns.:=
, we use NULL
to indicated we’d like to delete the left-hand columns. You should read this as “Columns 7:10 are defined as NULL”, which here is equivalent to deleting them. Note that we only need to say NULL
once and it applies to all four columns on the right-hand side.Note: in general, it is not recommended to refer to columns by number, since the order might change unexpectedly. It’s safer to refer to columns by their name.
It’s always a good idea to take a look at your data after loading it.
The functions head
and tail
, introduced in R Tutorial #1, make this easy:
head(survey)
## sex credits eye.color handedness height handspan
## 1: Male 5 Brown 1.0 67 20.0
## 2: Female 5 Brown 0.4 63 19.5
## 3: Female NA Brown 0.6 62 19.0
## 4: Female 5 Brown 0.6 65 19.5
## 5: Female 4 Brown 1.0 62 18.5
## 6: Female NA Brown 1.0 68 18.5
## height_handspan_ratio
## 1: 3.350000
## 2: 3.230769
## 3: 3.263158
## 4: 3.333333
## 5: 3.351351
## 6: 3.675676
tail(survey)
## sex credits eye.color handedness height handspan
## 1: Male 4 Hazel 1.00 73 21.0
## 2: Male 5 Brown 0.56 66 20.0
## 3: Female 5 Brown 0.80 63 18.0
## 4: Female 6 Brown 0.80 64 21.0
## 5: Female 6 Brown 1.00 61 17.5
## 6: Male 5 Brown 0.70 71 24.5
## height_handspan_ratio
## 1: 3.476190
## 2: 3.300000
## 3: 3.500000
## 4: 3.047619
## 5: 3.485714
## 6: 2.897959
hist
The command for a histogram in R is hist
. Its input has to be a vector, not a data.table
. Conveniently, we can use this function in the j
argument. For example, we can make a histogram of the column handedness
as follows:
survey[ , hist(handedness)]
The default title and axis labels aren’t very nice. We can change them using the following options:
survey[ , hist(handedness, xlab = 'Handedness Score',
main = 'Histogram of Handedness Scores',
ylab = '# of Students')]
See ?hist
or ?par
for a full list of options available during plotting; see also the exercises for a chance to explore some of these.
Recall that R doesn’t care if you use single or double quotes, as long are you’re consistent (the symbol you start with is the symbol you end with).
You can change the number of bins via the argument breaks
survey[ , hist(handedness, breaks = 20, xlab = 'Handedness Score',
main = 'Histogram of Handedness Scores')]
By default, R produces histograms in terms of frequencies. That is, it counts the numbers of observations in each bin. To change this to relative frequencies, that is, proportions, we set the argument freq
to FALSE
:
survey[ , hist(handedness, breaks = 20, freq = FALSE,
xlab = 'Handedness Score', main = 'Histogram of Handedness')]
Notice that if you don’t supply a label for the y-axis, R defaults to "Frequency"
unless you set freq = FALSE
, in which case it uses "Density"
. We’ll learn about densities later in the course.
plot
The general-purpose command for plotting in R is rather unimaginatively called plot
. This is an incredibly powerful command, but we’ll mostly use it to make simple scatter plots. To plot height
on the the x-axis and handspan
on the y-axis, we use the following command:
survey[ , plot(height, handspan)]
If you change the order, you reverse the plot:
survey[ , plot(handspan, height)]
You can use the same arguments for plot
as you did for hist
to change the title and axis labels and make the plot look cleaner:
survey[ , plot(height, handspan, xlab = "height (in)", ylab = "handspan (cm)")]
Plot has many different arguments (see ?plot
) that you can use to customize the appearance of your plot. We can change the color of the points using the col
argument:
survey[ , plot(height, handspan, xlab = "height (in)",
ylab = "handspan (cm)", col = "red")]
the shape of the points using the pch
argument:
survey[ , plot(height, handspan, xlab = "height (in)",
ylab = "handspan (cm)", col = "red", pch = 3)]
See ?points
for the full list of possible choices for pch
.
We can even plot connected line segments rather than points using the type
argument, although this isn’t really useful for the current example:
survey[ , plot(height, handspan, xlab = "height (in)",
ylab = "handspan (cm)", col = "red", pch = 3, type = 'l')]
(in type = 'l'
, 'l'
means _l_ine)
pairs
pairs
The function pairs
is a great way to visualize many columns of a data.table
at once. Unlike plot
, whose two first arguments need to be vectors, pairs
can take a whole data.table as its argument. It produces a kind of scatter plot on steroids, in which every column in the dataframe is plotted against every other column. Let’s try it out with some of the measurements from the survey:
pairs(survey[ , c("handedness", "handspan", "height")])
We’re including the three columns "handedness"
, "handspan"
, "height"
since they’re all continuous, as opposed to some of the others.
There seems to be a strong positive relationship between handspan and height but not much of a relationship between handeness and either of the other columns.
It takes a bit of practice to be able to read a pairs
plot, but the payoff can be huge for concisely understanding the relationships present in a data set. Spend a little time looking at this one to figure it out and look at the help file (?pairs
) if needed.
boxplot
A boxplot is an alternative way to visualize the distribution of a dataset. The command in R is just what you’d expect, and you can use the same labels as above. The first argument must be a vector:
boxplot(survey$handspan, ylab = "Handspan(cm)")
The “box” in a box plot shows the middle 50% of the data. The thick line in the middle is the median, and the lines immediately above and below are the 25th and 75th percentiles. This means that the width of the box equals the interquartile range of the data.
Traditionally, the “whiskers” in a boxplot show the maximum and minimum of the data, but R has an interesting special feature. If there are any observations that are extremely unusual (very big or very small compared to all other observations), it leaves them out when deciding where to put the whiskers and just plots them directly. Such points are called “outliers,” like the book by Malcom Gladwell. We see two such outliers in the plot above.
In case you’re interested, R considers an outlier to be any point that is more than 1.5 times the interquartile range away from the box. You don’t have to memorize this rule.
One of the main advantages of boxplots over histograms is that they are simple enough to plot side-by-side. This lets us see how the distribution of a numerical variable is related to a categorical variable. For example, we could see how the distribution of handspan differs for men and women as follows:
boxplot(handspan ~ sex, data = survey,
ylab= "Handspan (cm)", main = "Handspan by Sex")
The syntax used above is different from what we’ve seen so far but since it will appear again when we look at regression it’s worth spending some time to explain. The syntax handspan ~ sex
indicates that we want handspan
as a function of sex
. The argument data = survey
tells R where to find the variables handspan
and sex
: they’re stored as columns in the data.table
survey.
table
The command table
is used for making cross-tabs, aka two-way tables, a useful way of summarizing the relationship between two categorical variables. Both arguments of table
must be vectors. For example, we can make a cross-tab of eye color and sex as follows:
survey[ , table(eye.color, sex)]
## sex
## eye.color Female Male
## Black 2 5
## Blue 4 6
## Brown 32 26
## Copper 0 1
## Green 1 4
## Hazel 2 2
## Maroon 0 1
You can also use table
to make a one-way crosstab:
survey[ , table(eye.color)]
## eye.color
## Black Blue Brown Copper Green Hazel Maroon
## 7 10 59 1 5 4 1
If you want the row/column totals to be printed alongside your table, the approach in R is a little clumsy: you first need to give your table a name and store it, then use the function addmargins
:
my.table = survey[ , table(eye.color, sex)]
addmargins(my.table)
## sex
## eye.color Female Male Sum
## Black 2 5 7
## Blue 4 6 10
## Brown 32 26 58
## Copper 0 1 1
## Green 1 4 5
## Hazel 2 2 4
## Maroon 0 1 1
## Sum 41 45 86
To convert a cross-tab in R into percents rather than counts, we use the function prop.table
. This is a little clumsy as well. First, store the original table, then pass it as the argument to prop.table
as you did with addmargins
my.table = survey[ , table(eye.color, sex)]
prop.table(my.table)
## sex
## eye.color Female Male
## Black 0.02325581 0.05813953
## Blue 0.04651163 0.06976744
## Brown 0.37209302 0.30232558
## Copper 0.00000000 0.01162791
## Green 0.01162791 0.04651163
## Hazel 0.02325581 0.02325581
## Maroon 0.00000000 0.01162791
Now you have a table in which percents are expressed as decimals (i.e., proportions). For example, the value of 0.37209 in the row Brown
and the column Female
indicates that about 37% of the people in the dataset are women who have brown eyes.
To get ordinary percents, we multiply by 100:
100 * prop.table(my.table)
## sex
## eye.color Female Male
## Black 2.325581 5.813953
## Blue 4.651163 6.976744
## Brown 37.209302 30.232558
## Copper 0.000000 1.162791
## Green 1.162791 4.651163
## Hazel 2.325581 2.325581
## Maroon 0.000000 1.162791
To make this cleaner we can round the result to a desired number of decimal places:
round(100 * prop.table(my.table), digits = 1)
## sex
## eye.color Female Male
## Black 2.3 5.8
## Blue 4.7 7.0
## Brown 37.2 30.2
## Copper 0.0 1.2
## Green 1.2 4.7
## Hazel 2.3 2.3
## Maroon 0.0 1.2
round(100 * prop.table(my.table), digits = 0)
## sex
## eye.color Female Male
## Black 2 6
## Blue 5 7
## Brown 37 30
## Copper 0 1
## Green 1 5
## Hazel 2 2
## Maroon 0 1
You’ve met several of the summary statistics from class in R Tutorial #1. I’ll remind you of them here and introduce some others.
summary
This command takes a whole data.table
as its argument, unlike the other commands for summary statistics. It gives us the mean of each column along with the five-number summary. It also indicates if there are any missing observations (NA
’s – more on this below) in the columns:
summary(survey)
## sex credits eye.color handedness
## Length:89 Min. :4.000 Length:89 Min. :-0.8800
## Class :character 1st Qu.:5.000 Class :character 1st Qu.: 0.6000
## Mode :character Median :5.000 Mode :character Median : 0.7317
## Mean :5.012 Mean : 0.6579
## 3rd Qu.:5.000 3rd Qu.: 0.8955
## Max. :6.500 Max. : 1.0000
## NA's :3 NA's :1
## height handspan height_handspan_ratio
## Min. :58.00 Min. :14.0 Min. :2.739
## 1st Qu.:64.00 1st Qu.:19.0 1st Qu.:3.091
## Median :67.00 Median :20.5 Median :3.289
## Mean :67.55 Mean :20.6 Mean :3.304
## 3rd Qu.:71.00 3rd Qu.:22.0 3rd Qu.:3.455
## Max. :78.00 Max. :27.0 Max. :4.643
## NA's :1 NA's :1 NA's :2
NA
One ubiquitous fact of real-world data is missingness. For a gallimaufry of reasons, we’re unable to get correct measurements/information about all variables that we observe in all cases.
Being a language for statistics, R handles this missing data with aplomb by handling it under the appelation NA
. However, any user must understand that trying to perform calculations with missing observations may lead to unexpected outcomes, e.g.:
sum(survey$height)
## [1] NA
Any sum involving a missing observation is missing.
To tell R to ignore missing observations, set the argument na.rm
to TRUE
. This will work with pretty much any of the commands you know for performing mathematical operations on a vector:
sum(survey$height, na.rm = TRUE)
## [1] 5944
mean
Calculates the sample mean of a numeric vector
mean(survey$height, na.rm = TRUE)
## [1] 67.54545
Note that the effective sample size decreases by the number of missing observations!!
Compare:
#3 observations
mean(1:3)
## [1] 2
#3 observations, but one missing observation --
# so the denominator in the mean is 2
mean(c(1, 2, NA), na.rm = TRUE)
## [1] 1.5
var
Calculates the sample variance of a numeric vector
var(survey$height, na.rm = TRUE)
## [1] 19.74504
sd
Calculates the sample standard deviation of a numeric vector
sd(survey$height, na.rm = TRUE)
## [1] 4.443539
This is identical to the (positive) square root of the sample variance:
sqrt(var(survey$height, na.rm = TRUE))
## [1] 4.443539
median
Calculates the sample median of a numeric vector
median(survey$height, na.rm = TRUE)
## [1] 67
quantile
This function calculates sample quantiles, aka percentiles, of a numeric vector. If you simply pass it a numeric vector with no other arguments, it will give you the five-number summary:
quantile(survey$height, na.rm = TRUE)
## 0% 25% 50% 75% 100%
## 58 64 67 71 78
You can ask for specific quantile by using the argument probs
quantile(survey$height, na.rm = TRUE, probs = 0.3)
## 30%
## 65
Note that quantiles are specified as probabilities so the preceding command gives the 30th percentile. You can also ask for multiple quantiles at once:
quantile(survey$height, na.rm = TRUE, probs = c(0.1, 0.3, 0.7, 0.9))
## 10% 30% 70% 90%
## 62.0 65.0 70.0 73.3
IQR
Calculates the interquartile range of a numeric vector (the 75th percentile minus the 25th percentile)
IQR(survey$height, na.rm = TRUE)
## [1] 7
We could also use the quantile
argument to produce this like so:
x = quantile(survey$height, na.rm = TRUE, probs = c(.25, .75))
x[2] - x[1]
## 75%
## 7
max
, min
, range
, which.max
and which.min
max
and min
do exactly what they say:
max(survey$height, na.rm = TRUE)
## [1] 78
min(survey$height, na.rm = TRUE)
## [1] 58
and can be used to calculate the range
max(survey$height, na.rm = TRUE) - min(survey$height, na.rm = TRUE)
## [1] 20
To get both the maximum and minimum at once, you can use the function range
:
range(survey$height, na.rm = TRUE)
## [1] 58 78
Note that this does not compute the summary statistic called the range.
Sometimes we’re interested not just in what the max is, but who obtained the max. For example, we might want to know some other information about who the tallest/shortest person in the survey are. which.max
and which.min
return the row number corresponding to the maximum; we can use this to subset the data to display the full row corresponding to the maximum like so:
which.max(survey$height)
## [1] 67
survey[which.max(height)]
## sex credits eye.color handedness height handspan height_handspan_ratio
## 1: Male 5 Green 0.894 78 24 3.25
survey[which.min(height)]
## sex credits eye.color handedness height handspan height_handspan_ratio
## 1: Male 5 Brown 0.6 58 21 2.761905
factor
sR has a somewhat peculiar way of handling categorical variables. It does so by using what are called factor
variables. First, an example:
school = c("SAS", "Wharton", "Wharton", "SEAS", "SAS", "Nursing", "SAS")
school_factor = factor(school)
class(school)
## [1] "character"
class(school_factor)
## [1] "factor"
school
## [1] "SAS" "Wharton" "Wharton" "SEAS" "SAS" "Nursing" "SAS"
school_factor
## [1] SAS Wharton Wharton SEAS SAS Nursing SAS
## Levels: Nursing SAS SEAS Wharton
Note that when we pass the school
character
vector through the factor
function, the output, which I’ve called school_factor
, is now a "factor"
. Note the subtle difference in how school
and school_factor
are printed; the factor
is most easily distinguished by the extra line starting with Levels:
directly below the vector itself.
What’s going on here? First, another thing to observe, the dput
output of school_factor
. dput
is an excellent tool to have in your repertoire for dealing with more complicated/unknown objects in R. It simply prints out the most fundamental representation of the object that’s possible; for the things we’ve seen so far, this would not be very illuminating (although you should try looking at the dput
output of a data.table
on your own). What happens when we dput
a factor
?
dput(school_factor)
## structure(c(2L, 4L, 4L, 3L, 2L, 1L, 2L), .Label = c("Nursing",
## "SAS", "SEAS", "Wharton"), class = "factor")
Recall that school
has 7 elements. The core of what we see here is that underneath the hood, the thing that has 7 elements in school_factor
is now a vector of integers: c(2L, 4L, 4L, 3L, 2L, 1L, 2L)
.
A factor
in R is a labelled integer vector. The “labels” are actually called “levels”, as we can see from the levels
function:
levels(school_factor)
## [1] "Nursing" "SAS" "SEAS" "Wharton"
Note that the levels are in alphabetical order. The integer vector tells which label corresponds to a given element – for example, the first 2L
corresponds to "SAS"
, which is the second element of school_factor
’s levels – levels(school_factor)[2]
.
There are some historical reasons why categorical variables are handled like this; interested souls can read more here. Most of the time, you’ll be well-served just to keep variables as character
instead of factor
; the primary limitation is that categorical variables stored as character
must have their levels in alphabetical order, whereas there are ways to re-order the levels of a factor
variable to be in whatever order you’d like, which is often very helpful in regression settings.
Note that eye.color
is currently stored as a character
variable. How can we convert it to a factor
? We have to use :=
again:
summary(survey$eye.color)
## Length Class Mode
## 89 character character
survey[ , eye.color := factor(eye.color)]
summary(survey$eye.color)
## Black Blue Brown Copper Green Hazel Maroon NA's
## 7 10 59 1 5 4 1 2
Note what’s happening in the :=
assignment line – we’re overwriting eye.color
with a factor
version of itself.
by
Often, we want to calculate summary statistics for a data.table
broken down by the values of a categorical variable. data.table
has a very simple way to perform such group-wise summaries – the by
argument inside []
; let’s see some examples.
First we’ll use it to compare the average height of men in the class to that of women:
survey[ , mean(height, na.rm = TRUE), by = sex]
## sex V1
## 1: Male 70.37778
## 2: Female 64.50000
## 3: NA 68.00000
If we wanted to compare the variance of height by sex, we would just change mean
to var
survey[ , var(height, na.rm = TRUE), by = sex]
## sex V1
## 1: Male 14.240404
## 2: Female 8.304878
## 3: NA NA
Note that the name of the outcome variable was automatically chosen as V1
in both cases, which might make it hard to distinguish. If we want the name to be more meaningful, we use the following syntax:
survey[ , .(variance = var(height, na.rm = TRUE)), by = sex]
## sex variance
## 1: Male 14.240404
## 2: Female 8.304878
## 3: NA NA
We 1) wrap the computation in the j
argument with .()
and 2) assign the computation in j
to a name, here var(height, na.rm = TRUE)
is being assigned to the name variance
.
Now, suppose we wanted to compare average height by eye color. To do this, simply change the by
variable to eye.color
:
survey[ , mean(height, na.rm = TRUE), by = eye.color]
## eye.color V1
## 1: Brown 66.77966
## 2: Blue 69.00000
## 3: Black 68.71429
## 4: Hazel 67.75000
## 5: Green 71.50000
## 6: NA 64.00000
## 7: Maroon 74.00000
## 8: Copper 74.00000
It’s not much harder to find summaries grouped by two or more variables. Suppose we wanted to calculate average height by sex and credits. We simply surround the argument to by
with .()
, just like above when we wanted to name the output:
survey[ , .(avg_height = mean(height, na.rm = TRUE)), by = .(sex, credits)]
## sex credits avg_height
## 1: Male 5.0 70.00000
## 2: Female 5.0 64.79167
## 3: Female NA 64.66667
## 4: Female 4.0 65.00000
## 5: Female 6.0 63.33333
## 6: Male 4.0 71.11111
## 7: Male 5.5 69.60000
## 8: Female 6.5 65.00000
## 9: Male 6.0 71.00000
## 10: Male 4.5 72.00000
## 11: NA 5.5 68.00000
This contains all the information we want, but it’s not displayed in a particularly convenient format. Firstly, we may like to ignore the students with sex
or credits
missing like so using the function is.na
and the logical operators &
and !
:
The function is.na
performed on a vector returns whether each element is missing or not:
x = c(1, 2, NA, 3, NA, 4)
is.na(x)
## [1] FALSE FALSE TRUE FALSE TRUE FALSE
!
is the negation operator – it turns TRUE
to FALSE
and vice-versa:
!is.na(x)
## [1] TRUE TRUE FALSE TRUE FALSE TRUE
&
is the element-wise compound operator – logical1 & logical2
is TRUE
only on those elements for which both logical1
and logical2
are TRUE
:
y = c(NA, 1, NA, 2, 3, NA)
is.na(y)
## [1] TRUE FALSE TRUE FALSE FALSE TRUE
!is.na(y)
## [1] FALSE TRUE FALSE TRUE TRUE FALSE
!is.na(x) & !is.na(y)
## [1] FALSE TRUE FALSE TRUE FALSE FALSE
So to exclude observations for which sex
or credits
are missing can be accomplished like so:
survey[!is.na(sex) & !is.na(credits),
.(avg_height = mean(height, na.rm = TRUE)),
by = .(sex, credits)]
## sex credits avg_height
## 1: Male 5.0 70.00000
## 2: Female 5.0 64.79167
## 3: Female 4.0 65.00000
## 4: Female 6.0 63.33333
## 5: Male 4.0 71.11111
## 6: Male 5.5 69.60000
## 7: Female 6.5 65.00000
## 8: Male 6.0 71.00000
## 9: Male 4.5 72.00000
Lastly, it would be nice to have the results in better order; to do this, we use the argument keyby
instead of by
. keyby
is exactly the same as by
, except that it sorts the output:
survey[!is.na(sex) & !is.na(credits),
.(avg_height = mean(height, na.rm = TRUE)),
keyby = .(sex, credits)]
## sex credits avg_height
## 1: Female 4.0 65.00000
## 2: Female 5.0 64.79167
## 3: Female 6.0 63.33333
## 4: Female 6.5 65.00000
## 5: Male 4.0 71.11111
## 6: Male 4.5 72.00000
## 7: Male 5.0 70.00000
## 8: Male 5.5 69.60000
## 9: Male 6.0 71.00000
cut
In class we talked about the idea of putting numerical data into “bins” to make a histogram. More generally, it’s sometimes necessary to convert numerical data into categorical data. A classic example is converting scores from a course into letter grades. You can do this using the command cut
. (In fact, this is the function I’ll use to determine your grades at the end of the semester!) First I’ll create some fake grade data for us to play with:
grades = c(67, 93, 85, 82, 88, 86, 78, 97, 74, 77, 81)
The cut
function is a little tricky, so it’s worth opening the help file (?cut
) to consult as you read this. Its first argument is a numeric vector: the vector we want to “cut”. It’s second argument gives the “cut points.” Let’s try this out with some simple grade boundaries:
cut(grades, c(60, 70, 80, 90, 100))
## [1] (60,70] (90,100] (80,90] (80,90] (80,90] (80,90] (70,80]
## [8] (90,100] (70,80] (70,80] (80,90]
## Levels: (60,70] (70,80] (80,90] (90,100]
cut
has created a factor
, which we can tell because of the line "Levels: (60,70] ..."
that appears in the output. Notice that by default the intervals used in cut are “open on the left, closed on the right.” You can reverse this by setting right = FALSE
cut(grades, c(60, 70, 80, 90, 100), right = FALSE)
## [1] [60,70) [90,100) [80,90) [80,90) [80,90) [80,90) [70,80)
## [8] [90,100) [70,80) [70,80) [80,90)
## Levels: [60,70) [70,80) [80,90) [90,100)
This is what we’d use for grade cut-offs since they take the form “X or higher.”
The last step is to add an additional argument to cut
that gives the factor levels meangingful names: in this case, letter grades:
cut(grades, c(60, 70, 80, 90, 100),
labels = c("D", "C", "B", "A"), right = FALSE)
## [1] D A B B B B C A C C B
## Levels: D C B A
Notice that if you used \(n\) breakpoints, you need \(n-1\) labels since they correspond to the “gaps” between the breaks.
Now it’s your turn: using what you’ve learned above, you’ll take a closer look at the titanic dataset introduced in R Tutorial #1. You already know enough R to answer some interesting questions, so let’s dive in!
Read the documentation file for the titanic dataset so that you know what each column in the data represents: http://www.ditraglia.com/econ103/titanic3.txt
Download the data, store it in a data.table
called titanic
, and display the first few rows.
The URL where the data is stored is:
http://www.ditraglia.com/econ103/titanic3.csv
We’ll only be using the following columns of titanic for this example: pclass
, survived
, sex
, age
, and fare
. Delete the other columns from the data set. (Note that I did the same thing above with the survey
data.) Display the first few rows to make sure your command worked.
Convert the variables sex
and pclass
to factor
.
summary
to get an overview of the dataset, answer the following questions:fare
by pclass
and interpret your results.horizontal
and las
arguments. See ?boxplot
and ?par
.Who paid more, on average, for passage on the titanic: men or women?
Create a table of mean fares broken down by sex
and pclass
. How does this relate to your answer to the preceding question?
How did the fraction of survivors vary by sex
pclass
? Can you think of a possible explanation for the pattern you see in the data?
titanic
a categorical variable called age.group
from the numerical variable age
using the following cutoffs (http://www.statcan.gc.ca/concepts/definitions/age2-eng.htm):pclass
and age.group
. [Hint: by = .()
can accept more than two arguments] What do your results suggest? Is there any thing wrong with the way we defined the "Senior"
age group in the previous part?