This tutorial has three parts. In the first part, you’ll learn how to implement the summary statistics and graphics from class using R. In the second part, you’ll use what you’ve learned to explore a real-world dataset: the passenger manifest from the Titanic.

Summary Stats and Graphics

Load the Data

In this section we’ll work with data from the student survey. Since results for this semester haven’t been collected yet, we’ll start with data from a previous semester:

#You should have data.table installed from the first R tutorial.
#  if not, run the following line of code before library:
#  install.packages("data.table")
library(data.table)
survey = fread("http://www.ditraglia.com/econ103/old_survey.csv")

Adding Columns

Adding and subtracting columns from a data.table is done with the special assignment operator :=, which I read as “defined as”. There are other ways to assign and remove columns from a data.table or data.frame, but there are issues with this approach; if you’re interested, see here, here and here for more background on this issue.

Let’s add a column to survey that is the ratio of height to handspan:

survey[ , height_handspan_ratio := height/handspan]

As always with data.tables, when we’re doing an operation specifying something about columns, we do so in the second argument (often referred to as simply j) within the square brackets []. The first argument (similarly, i) tells R which rows we’re interested in; by leaving it blank, we’re telling R to operate on all rows.

Removing Columns

Removing a column from a data.table shares basically the same syntax as adding a columns – the crux is the := operator. We’ll exclude columns 7 through 10:

survey[ , 7:10 := NULL]

There are two new things going on in this compact line:

  • On the left-hand side of :=, we’re adding/subtract more than one column. We can tell this by knowing that 7:10 is really c(7, 8, 9, 10), which means := is being done to the 7th, 8th, 9th, and 10th columns.
  • On the right-hand side of :=, we use NULL to indicated we’d like to delete the left-hand columns. You should read this as “Columns 7:10 are defined as NULL”, which here is equivalent to deleting them. Note that we only need to say NULL once and it applies to all four columns on the right-hand side.

Note: in general, it is not recommended to refer to columns by number, since the order might change unexpectedly. It’s safer to refer to columns by their name.

Previewing Data

It’s always a good idea to take a look at your data after loading it.

The functions head and tail, introduced in R Tutorial #1, make this easy:

head(survey)
##       sex credits eye.color handedness height handspan
## 1:   Male       5     Brown        1.0     67     20.0
## 2: Female       5     Brown        0.4     63     19.5
## 3: Female      NA     Brown        0.6     62     19.0
## 4: Female       5     Brown        0.6     65     19.5
## 5: Female       4     Brown        1.0     62     18.5
## 6: Female      NA     Brown        1.0     68     18.5
##    height_handspan_ratio
## 1:              3.350000
## 2:              3.230769
## 3:              3.263158
## 4:              3.333333
## 5:              3.351351
## 6:              3.675676
tail(survey)
##       sex credits eye.color handedness height handspan
## 1:   Male       4     Hazel       1.00     73     21.0
## 2:   Male       5     Brown       0.56     66     20.0
## 3: Female       5     Brown       0.80     63     18.0
## 4: Female       6     Brown       0.80     64     21.0
## 5: Female       6     Brown       1.00     61     17.5
## 6:   Male       5     Brown       0.70     71     24.5
##    height_handspan_ratio
## 1:              3.476190
## 2:              3.300000
## 3:              3.500000
## 4:              3.047619
## 5:              3.485714
## 6:              2.897959

Histograms: hist

The command for a histogram in R is hist. Its input has to be a vector, not a data.table. Conveniently, we can use this function in the j argument. For example, we can make a histogram of the column handedness as follows:

survey[ , hist(handedness)]

The default title and axis labels aren’t very nice. We can change them using the following options:

survey[ , hist(handedness, xlab = 'Handedness Score',
               main = 'Histogram of Handedness Scores',
               ylab = '# of Students')]

See ?hist or ?par for a full list of options available during plotting; see also the exercises for a chance to explore some of these.

Recall that R doesn’t care if you use single or double quotes, as long are you’re consistent (the symbol you start with is the symbol you end with).

You can change the number of bins via the argument breaks

survey[ , hist(handedness, breaks = 20, xlab = 'Handedness Score',
               main = 'Histogram of Handedness Scores')]

By default, R produces histograms in terms of frequencies. That is, it counts the numbers of observations in each bin. To change this to relative frequencies, that is, proportions, we set the argument freq to FALSE:

survey[ , hist(handedness, breaks = 20, freq = FALSE,
               xlab = 'Handedness Score', main = 'Histogram of Handedness')]

Notice that if you don’t supply a label for the y-axis, R defaults to "Frequency" unless you set freq = FALSE, in which case it uses "Density". We’ll learn about densities later in the course.

Scatter/Line Plots: plot

The general-purpose command for plotting in R is rather unimaginatively called plot. This is an incredibly powerful command, but we’ll mostly use it to make simple scatter plots. To plot height on the the x-axis and handspan on the y-axis, we use the following command:

survey[ , plot(height, handspan)]

If you change the order, you reverse the plot:

survey[ , plot(handspan, height)]

You can use the same arguments for plot as you did for hist to change the title and axis labels and make the plot look cleaner:

survey[ , plot(height, handspan, xlab = "height (in)", ylab = "handspan (cm)")]

Plot has many different arguments (see ?plot) that you can use to customize the appearance of your plot. We can change the color of the points using the col argument:

survey[ , plot(height, handspan, xlab = "height (in)",
               ylab = "handspan (cm)", col = "red")]

the shape of the points using the pch argument:

survey[ , plot(height, handspan, xlab = "height (in)",
               ylab = "handspan (cm)", col = "red", pch = 3)]

See ?points for the full list of possible choices for pch.

We can even plot connected line segments rather than points using the type argument, although this isn’t really useful for the current example:

survey[ , plot(height, handspan, xlab = "height (in)",
               ylab = "handspan (cm)", col = "red", pch = 3, type = 'l')]

(in type = 'l', 'l' means _l_ine)

Scatterplot Matrix: pairs

pairs

The function pairs is a great way to visualize many columns of a data.table at once. Unlike plot, whose two first arguments need to be vectors, pairs can take a whole data.table as its argument. It produces a kind of scatter plot on steroids, in which every column in the dataframe is plotted against every other column. Let’s try it out with some of the measurements from the survey:

pairs(survey[ , c("handedness", "handspan", "height")])

We’re including the three columns "handedness", "handspan", "height" since they’re all continuous, as opposed to some of the others.

There seems to be a strong positive relationship between handspan and height but not much of a relationship between handeness and either of the other columns.

It takes a bit of practice to be able to read a pairs plot, but the payoff can be huge for concisely understanding the relationships present in a data set. Spend a little time looking at this one to figure it out and look at the help file (?pairs) if needed.

Box Plots: boxplot

A boxplot is an alternative way to visualize the distribution of a dataset. The command in R is just what you’d expect, and you can use the same labels as above. The first argument must be a vector:

boxplot(survey$handspan, ylab = "Handspan(cm)")

The “box” in a box plot shows the middle 50% of the data. The thick line in the middle is the median, and the lines immediately above and below are the 25th and 75th percentiles. This means that the width of the box equals the interquartile range of the data.

Traditionally, the “whiskers” in a boxplot show the maximum and minimum of the data, but R has an interesting special feature. If there are any observations that are extremely unusual (very big or very small compared to all other observations), it leaves them out when deciding where to put the whiskers and just plots them directly. Such points are called “outliers,” like the book by Malcom Gladwell. We see two such outliers in the plot above.

In case you’re interested, R considers an outlier to be any point that is more than 1.5 times the interquartile range away from the box. You don’t have to memorize this rule.

One of the main advantages of boxplots over histograms is that they are simple enough to plot side-by-side. This lets us see how the distribution of a numerical variable is related to a categorical variable. For example, we could see how the distribution of handspan differs for men and women as follows:

boxplot(handspan ~ sex, data = survey, 
        ylab= "Handspan (cm)", main = "Handspan by Sex")

The syntax used above is different from what we’ve seen so far but since it will appear again when we look at regression it’s worth spending some time to explain. The syntax handspan ~ sex indicates that we want handspan as a function of sex. The argument data = survey tells R where to find the variables handspan and sex: they’re stored as columns in the data.table survey.

One- and Two-way Tables: table

The command table is used for making cross-tabs, aka two-way tables, a useful way of summarizing the relationship between two categorical variables. Both arguments of table must be vectors. For example, we can make a cross-tab of eye color and sex as follows:

survey[ , table(eye.color, sex)]
##          sex
## eye.color Female Male
##    Black       2    5
##    Blue        4    6
##    Brown      32   26
##    Copper      0    1
##    Green       1    4
##    Hazel       2    2
##    Maroon      0    1

You can also use table to make a one-way crosstab:

survey[ , table(eye.color)]
## eye.color
##  Black   Blue  Brown Copper  Green  Hazel Maroon 
##      7     10     59      1      5      4      1

If you want the row/column totals to be printed alongside your table, the approach in R is a little clumsy: you first need to give your table a name and store it, then use the function addmargins:

my.table = survey[ , table(eye.color, sex)]
addmargins(my.table)
##          sex
## eye.color Female Male Sum
##    Black       2    5   7
##    Blue        4    6  10
##    Brown      32   26  58
##    Copper      0    1   1
##    Green       1    4   5
##    Hazel       2    2   4
##    Maroon      0    1   1
##    Sum        41   45  86

To convert a cross-tab in R into percents rather than counts, we use the function prop.table. This is a little clumsy as well. First, store the original table, then pass it as the argument to prop.table as you did with addmargins

my.table = survey[ , table(eye.color, sex)]
prop.table(my.table)
##          sex
## eye.color     Female       Male
##    Black  0.02325581 0.05813953
##    Blue   0.04651163 0.06976744
##    Brown  0.37209302 0.30232558
##    Copper 0.00000000 0.01162791
##    Green  0.01162791 0.04651163
##    Hazel  0.02325581 0.02325581
##    Maroon 0.00000000 0.01162791

Now you have a table in which percents are expressed as decimals (i.e., proportions). For example, the value of 0.37209 in the row Brown and the column Female indicates that about 37% of the people in the dataset are women who have brown eyes.

To get ordinary percents, we multiply by 100:

100 * prop.table(my.table)
##          sex
## eye.color    Female      Male
##    Black   2.325581  5.813953
##    Blue    4.651163  6.976744
##    Brown  37.209302 30.232558
##    Copper  0.000000  1.162791
##    Green   1.162791  4.651163
##    Hazel   2.325581  2.325581
##    Maroon  0.000000  1.162791

To make this cleaner we can round the result to a desired number of decimal places:

round(100 * prop.table(my.table), digits = 1)
##          sex
## eye.color Female Male
##    Black     2.3  5.8
##    Blue      4.7  7.0
##    Brown    37.2 30.2
##    Copper    0.0  1.2
##    Green     1.2  4.7
##    Hazel     2.3  2.3
##    Maroon    0.0  1.2
round(100 * prop.table(my.table), digits = 0)
##          sex
## eye.color Female Male
##    Black       2    6
##    Blue        5    7
##    Brown      37   30
##    Copper      0    1
##    Green       1    5
##    Hazel       2    2
##    Maroon      0    1

You’ve met several of the summary statistics from class in R Tutorial #1. I’ll remind you of them here and introduce some others.

summary

This command takes a whole data.table as its argument, unlike the other commands for summary statistics. It gives us the mean of each column along with the five-number summary. It also indicates if there are any missing observations (NA’s – more on this below) in the columns:

summary(survey)
##      sex               credits       eye.color           handedness     
##  Length:89          Min.   :4.000   Length:89          Min.   :-0.8800  
##  Class :character   1st Qu.:5.000   Class :character   1st Qu.: 0.6000  
##  Mode  :character   Median :5.000   Mode  :character   Median : 0.7317  
##                     Mean   :5.012                      Mean   : 0.6579  
##                     3rd Qu.:5.000                      3rd Qu.: 0.8955  
##                     Max.   :6.500                      Max.   : 1.0000  
##                     NA's   :3                          NA's   :1        
##      height         handspan    height_handspan_ratio
##  Min.   :58.00   Min.   :14.0   Min.   :2.739        
##  1st Qu.:64.00   1st Qu.:19.0   1st Qu.:3.091        
##  Median :67.00   Median :20.5   Median :3.289        
##  Mean   :67.55   Mean   :20.6   Mean   :3.304        
##  3rd Qu.:71.00   3rd Qu.:22.0   3rd Qu.:3.455        
##  Max.   :78.00   Max.   :27.0   Max.   :4.643        
##  NA's   :1       NA's   :1      NA's   :2

Missing Data: NA

One ubiquitous fact of real-world data is missingness. For a gallimaufry of reasons, we’re unable to get correct measurements/information about all variables that we observe in all cases.

Being a language for statistics, R handles this missing data with aplomb by handling it under the appelation NA. However, any user must understand that trying to perform calculations with missing observations may lead to unexpected outcomes, e.g.:

sum(survey$height)
## [1] NA

Any sum involving a missing observation is missing.

To tell R to ignore missing observations, set the argument na.rm to TRUE. This will work with pretty much any of the commands you know for performing mathematical operations on a vector:

sum(survey$height, na.rm = TRUE)
## [1] 5944

Average: mean

Calculates the sample mean of a numeric vector

mean(survey$height, na.rm = TRUE)
## [1] 67.54545

Note that the effective sample size decreases by the number of missing observations!!

Compare:

#3 observations
mean(1:3)
## [1] 2
#3 observations, but one missing observation --
#  so the denominator in the mean is 2
mean(c(1, 2, NA), na.rm = TRUE)
## [1] 1.5

Variance: var

Calculates the sample variance of a numeric vector

var(survey$height, na.rm = TRUE)
## [1] 19.74504

Standard Deviation: sd

Calculates the sample standard deviation of a numeric vector

sd(survey$height, na.rm = TRUE)
## [1] 4.443539

This is identical to the (positive) square root of the sample variance:

sqrt(var(survey$height, na.rm = TRUE))
## [1] 4.443539

Median: median

Calculates the sample median of a numeric vector

median(survey$height, na.rm = TRUE)
## [1] 67

Other Quantiles: quantile

This function calculates sample quantiles, aka percentiles, of a numeric vector. If you simply pass it a numeric vector with no other arguments, it will give you the five-number summary:

quantile(survey$height, na.rm = TRUE)
##   0%  25%  50%  75% 100% 
##   58   64   67   71   78

You can ask for specific quantile by using the argument probs

quantile(survey$height, na.rm = TRUE, probs = 0.3)
## 30% 
##  65

Note that quantiles are specified as probabilities so the preceding command gives the 30th percentile. You can also ask for multiple quantiles at once:

quantile(survey$height, na.rm = TRUE, probs = c(0.1, 0.3, 0.7, 0.9))
##  10%  30%  70%  90% 
## 62.0 65.0 70.0 73.3

Inter-quartile Range: IQR

Calculates the interquartile range of a numeric vector (the 75th percentile minus the 25th percentile)

IQR(survey$height, na.rm = TRUE)
## [1] 7

We could also use the quantile argument to produce this like so:

x = quantile(survey$height, na.rm = TRUE, probs = c(.25, .75))
x[2] - x[1]
## 75% 
##   7

Extrema: max, min, range, which.max and which.min

max and min do exactly what they say:

max(survey$height, na.rm = TRUE)
## [1] 78
min(survey$height, na.rm = TRUE)
## [1] 58

and can be used to calculate the range

max(survey$height, na.rm = TRUE) - min(survey$height, na.rm = TRUE)
## [1] 20

To get both the maximum and minimum at once, you can use the function range:

range(survey$height, na.rm = TRUE)
## [1] 58 78

Note that this does not compute the summary statistic called the range.

Sometimes we’re interested not just in what the max is, but who obtained the max. For example, we might want to know some other information about who the tallest/shortest person in the survey are. which.max and which.min return the row number corresponding to the maximum; we can use this to subset the data to display the full row corresponding to the maximum like so:

which.max(survey$height)
## [1] 67
survey[which.max(height)]
##     sex credits eye.color handedness height handspan height_handspan_ratio
## 1: Male       5     Green      0.894     78       24                  3.25
survey[which.min(height)]
##     sex credits eye.color handedness height handspan height_handspan_ratio
## 1: Male       5     Brown        0.6     58       21              2.761905

Categorical Variables in R: factors

R has a somewhat peculiar way of handling categorical variables. It does so by using what are called factor variables. First, an example:

school = c("SAS", "Wharton", "Wharton", "SEAS", "SAS", "Nursing", "SAS")
school_factor = factor(school)
class(school)
## [1] "character"
class(school_factor)
## [1] "factor"
school
## [1] "SAS"     "Wharton" "Wharton" "SEAS"    "SAS"     "Nursing" "SAS"
school_factor
## [1] SAS     Wharton Wharton SEAS    SAS     Nursing SAS    
## Levels: Nursing SAS SEAS Wharton

Note that when we pass the school character vector through the factor function, the output, which I’ve called school_factor, is now a "factor". Note the subtle difference in how school and school_factor are printed; the factor is most easily distinguished by the extra line starting with Levels: directly below the vector itself.

What’s going on here? First, another thing to observe, the dput output of school_factor. dput is an excellent tool to have in your repertoire for dealing with more complicated/unknown objects in R. It simply prints out the most fundamental representation of the object that’s possible; for the things we’ve seen so far, this would not be very illuminating (although you should try looking at the dput output of a data.table on your own). What happens when we dput a factor?

dput(school_factor)
## structure(c(2L, 4L, 4L, 3L, 2L, 1L, 2L), .Label = c("Nursing", 
## "SAS", "SEAS", "Wharton"), class = "factor")

Recall that school has 7 elements. The core of what we see here is that underneath the hood, the thing that has 7 elements in school_factor is now a vector of integers: c(2L, 4L, 4L, 3L, 2L, 1L, 2L).

A factor in R is a labelled integer vector. The “labels” are actually called “levels”, as we can see from the levels function:

levels(school_factor)
## [1] "Nursing" "SAS"     "SEAS"    "Wharton"

Note that the levels are in alphabetical order. The integer vector tells which label corresponds to a given element – for example, the first 2L corresponds to "SAS", which is the second element of school_factor’s levels – levels(school_factor)[2].

There are some historical reasons why categorical variables are handled like this; interested souls can read more here. Most of the time, you’ll be well-served just to keep variables as character instead of factor; the primary limitation is that categorical variables stored as character must have their levels in alphabetical order, whereas there are ways to re-order the levels of a factor variable to be in whatever order you’d like, which is often very helpful in regression settings.

Note that eye.color is currently stored as a character variable. How can we convert it to a factor? We have to use := again:

summary(survey$eye.color)
##    Length     Class      Mode 
##        89 character character
survey[ , eye.color := factor(eye.color)]
summary(survey$eye.color)
##  Black   Blue  Brown Copper  Green  Hazel Maroon   NA's 
##      7     10     59      1      5      4      1      2

Note what’s happening in the := assignment line – we’re overwriting eye.color with a factor version of itself.

Averages by Category: by

Often, we want to calculate summary statistics for a data.table broken down by the values of a categorical variable. data.table has a very simple way to perform such group-wise summaries – the by argument inside []; let’s see some examples.

First we’ll use it to compare the average height of men in the class to that of women:

survey[ , mean(height, na.rm = TRUE), by = sex]
##       sex       V1
## 1:   Male 70.37778
## 2: Female 64.50000
## 3:     NA 68.00000

If we wanted to compare the variance of height by sex, we would just change mean to var

survey[ , var(height, na.rm = TRUE), by = sex]
##       sex        V1
## 1:   Male 14.240404
## 2: Female  8.304878
## 3:     NA        NA

Note that the name of the outcome variable was automatically chosen as V1 in both cases, which might make it hard to distinguish. If we want the name to be more meaningful, we use the following syntax:

survey[ , .(variance = var(height, na.rm = TRUE)), by = sex]
##       sex  variance
## 1:   Male 14.240404
## 2: Female  8.304878
## 3:     NA        NA

We 1) wrap the computation in the j argument with .() and 2) assign the computation in j to a name, here var(height, na.rm = TRUE) is being assigned to the name variance.

Now, suppose we wanted to compare average height by eye color. To do this, simply change the by variable to eye.color:

survey[ , mean(height, na.rm = TRUE), by = eye.color]
##    eye.color       V1
## 1:     Brown 66.77966
## 2:      Blue 69.00000
## 3:     Black 68.71429
## 4:     Hazel 67.75000
## 5:     Green 71.50000
## 6:        NA 64.00000
## 7:    Maroon 74.00000
## 8:    Copper 74.00000

It’s not much harder to find summaries grouped by two or more variables. Suppose we wanted to calculate average height by sex and credits. We simply surround the argument to by with .(), just like above when we wanted to name the output:

survey[ , .(avg_height = mean(height, na.rm = TRUE)), by = .(sex, credits)]
##        sex credits avg_height
##  1:   Male     5.0   70.00000
##  2: Female     5.0   64.79167
##  3: Female      NA   64.66667
##  4: Female     4.0   65.00000
##  5: Female     6.0   63.33333
##  6:   Male     4.0   71.11111
##  7:   Male     5.5   69.60000
##  8: Female     6.5   65.00000
##  9:   Male     6.0   71.00000
## 10:   Male     4.5   72.00000
## 11:     NA     5.5   68.00000

This contains all the information we want, but it’s not displayed in a particularly convenient format. Firstly, we may like to ignore the students with sex or credits missing like so using the function is.na and the logical operators & and !:

The function is.na performed on a vector returns whether each element is missing or not:

x = c(1, 2, NA, 3, NA, 4)
is.na(x)
## [1] FALSE FALSE  TRUE FALSE  TRUE FALSE

! is the negation operator – it turns TRUE to FALSE and vice-versa:

!is.na(x)
## [1]  TRUE  TRUE FALSE  TRUE FALSE  TRUE

& is the element-wise compound operator – logical1 & logical2 is TRUE only on those elements for which both logical1 and logical2 are TRUE:

y = c(NA, 1, NA, 2, 3, NA)
is.na(y)
## [1]  TRUE FALSE  TRUE FALSE FALSE  TRUE
!is.na(y)
## [1] FALSE  TRUE FALSE  TRUE  TRUE FALSE
!is.na(x) & !is.na(y)
## [1] FALSE  TRUE FALSE  TRUE FALSE FALSE

So to exclude observations for which sex or credits are missing can be accomplished like so:

survey[!is.na(sex) & !is.na(credits),
       .(avg_height = mean(height, na.rm = TRUE)),
       by = .(sex, credits)]
##       sex credits avg_height
## 1:   Male     5.0   70.00000
## 2: Female     5.0   64.79167
## 3: Female     4.0   65.00000
## 4: Female     6.0   63.33333
## 5:   Male     4.0   71.11111
## 6:   Male     5.5   69.60000
## 7: Female     6.5   65.00000
## 8:   Male     6.0   71.00000
## 9:   Male     4.5   72.00000

Lastly, it would be nice to have the results in better order; to do this, we use the argument keyby instead of by. keyby is exactly the same as by, except that it sorts the output:

survey[!is.na(sex) & !is.na(credits),
       .(avg_height = mean(height, na.rm = TRUE)),
       keyby = .(sex, credits)]
##       sex credits avg_height
## 1: Female     4.0   65.00000
## 2: Female     5.0   64.79167
## 3: Female     6.0   63.33333
## 4: Female     6.5   65.00000
## 5:   Male     4.0   71.11111
## 6:   Male     4.5   72.00000
## 7:   Male     5.0   70.00000
## 8:   Male     5.5   69.60000
## 9:   Male     6.0   71.00000

cut

In class we talked about the idea of putting numerical data into “bins” to make a histogram. More generally, it’s sometimes necessary to convert numerical data into categorical data. A classic example is converting scores from a course into letter grades. You can do this using the command cut. (In fact, this is the function I’ll use to determine your grades at the end of the semester!) First I’ll create some fake grade data for us to play with:

grades = c(67, 93, 85, 82, 88, 86, 78, 97, 74, 77, 81)

The cut function is a little tricky, so it’s worth opening the help file (?cut) to consult as you read this. Its first argument is a numeric vector: the vector we want to “cut”. It’s second argument gives the “cut points.” Let’s try this out with some simple grade boundaries:

cut(grades, c(60, 70, 80, 90, 100))
##  [1] (60,70]  (90,100] (80,90]  (80,90]  (80,90]  (80,90]  (70,80] 
##  [8] (90,100] (70,80]  (70,80]  (80,90] 
## Levels: (60,70] (70,80] (80,90] (90,100]

cut has created a factor, which we can tell because of the line "Levels: (60,70] ..." that appears in the output. Notice that by default the intervals used in cut are “open on the left, closed on the right.” You can reverse this by setting right = FALSE

cut(grades, c(60, 70, 80, 90, 100), right = FALSE)
##  [1] [60,70)  [90,100) [80,90)  [80,90)  [80,90)  [80,90)  [70,80) 
##  [8] [90,100) [70,80)  [70,80)  [80,90) 
## Levels: [60,70) [70,80) [80,90) [90,100)

This is what we’d use for grade cut-offs since they take the form “X or higher.”

The last step is to add an additional argument to cut that gives the factor levels meangingful names: in this case, letter grades:

cut(grades, c(60, 70, 80, 90, 100), 
    labels = c("D", "C", "B", "A"), right = FALSE)
##  [1] D A B B B B C A C C B
## Levels: D C B A

Notice that if you used \(n\) breakpoints, you need \(n-1\) labels since they correspond to the “gaps” between the breaks.

Exploring the Titanic

Now it’s your turn: using what you’ve learned above, you’ll take a closer look at the titanic dataset introduced in R Tutorial #1. You already know enough R to answer some interesting questions, so let’s dive in!

A Helpful Hint: Taking the mean of a vector of ones and zeros is the same thing as calculating the proportion of ones.

  1. Read the documentation file for the titanic dataset so that you know what each column in the data represents: http://www.ditraglia.com/econ103/titanic3.txt

  2. Download the data, store it in a data.table called titanic, and display the first few rows.

The URL where the data is stored is:

http://www.ditraglia.com/econ103/titanic3.csv

  1. We’ll only be using the following columns of titanic for this example: pclass, survived, sex, age, and fare. Delete the other columns from the data set. (Note that I did the same thing above with the survey data.) Display the first few rows to make sure your command worked.

  2. Convert the variables sex and pclass to factor.

  3. Using the command summary to get an overview of the dataset, answer the following questions:
  • Are there any missing values in this dataset?
  • What was the average age of passengers on the Titanic?
  • What was the average ticket price for passengers on the ship?
  • Do the ticket prices show evidence of skewness? If so, are they positively or negatively skewed?
  • Were there more men or women aboard the Titanic?
  • What proportion of the passengers survived the disaster?
  • To which social class did most of the passengers belong?
  1. Being sure to allow for missing values if necessary, calculate (or plot) and interpret the following:
  • The standard deviation of fares
  • The 90th Percentile of fares
  • Histogram of fares, colored red
  1. Create a boxplot of fare by pclass and interpret your results.
  • Explore using the horizontal and las arguments. See ?boxplot and ?par.
  1. Answer the following related to those passengers who survived the disaster:
  • What was the average age of the survivors?
  • Among the survivors, were there more men or women?
  • Do you notice anything different about social class for survivors compared to all passengers?
  1. Who paid more, on average, for passage on the titanic: men or women?

  2. Create a table of mean fares broken down by sex and pclass. How does this relate to your answer to the preceding question?

  3. How did the fraction of survivors vary by sex pclass? Can you think of a possible explanation for the pattern you see in the data?

  4. Add to titanic a categorical variable called age.group from the numerical variable age using the following cutoffs (http://www.statcan.gc.ca/concepts/definitions/age2-eng.htm):
  • Child = [0, 15)
  • Youth = [15, 25)
  • Adult = [25, 65)
  • Senior = [65, 200)
  1. Create two tables: one for men and one for women. Each table should break down survival proportions by pclass and age.group. [Hint: by = .() can accept more than two arguments] What do your results suggest? Is there any thing wrong with the way we defined the "Senior" age group in the previous part?