- Summary Stats and Graphics
- Load the Data
- Adding Columns
- Removing Columns
- Previewing Data
- Histograms:
`hist`

- Scatter/Line Plots:
`plot`

- Scatterplot Matrix:
`pairs`

`pairs`

- Box Plots:
`boxplot`

- One- and Two-way Tables:
`table`

`summary`

- Missing Data:
`NA`

- Average:
`mean`

- Variance:
`var`

- Standard Deviation:
`sd`

- Median:
`median`

- Other Quantiles:
`quantile`

- Inter-quartile Range:
`IQR`

- Extrema:
`max`

,`min`

,`range`

,`which.max`

and`which.min`

- Categorical Variables in R:
`factor`

s - Averages by Category:
`by`

`cut`

- Exploring the Titanic
- Philadelphia Employee Wages

This tutorial has three parts. In the first part, you’ll learn how to implement the summary statistics and graphics from class using R. In the second part, you’ll use what you’ve learned to explore a real-world dataset: the passenger manifest from the Titanic. In the third (and shortest) part, you’ll explore some data relevant to the city you live in – wage data for all Philadelphia government employees.

In this section we’ll work with data from the student survey. Since results for this semester haven’t been collected yet, we’ll start with data from a previous semester:

```
#You should have data.table installed from the first R tutorial.
# if not, run the following line of code before library:
# install.packages("data.table")
library(data.table)
survey = fread("http://www.ditraglia.com/econ103/old_survey.csv")
```

Adding and subtracting columns from a `data.table`

is done with the special assignment operator `:=`

, which I read as “defined as”. There are other ways to assign and remove columns from a `data.table`

or `data.frame`

, but there are issues with this approach; if you’re interested, see here, here and here for more background on this issue.

Let’s add a column to `survey`

that is the ratio of `height`

to `handspan`

:

`survey[ , height_handspan_ratio := height/handspan]`

As always with `data.table`

s, when we’re doing an operation specifying something about *columns*, we do so in the *second* argument (often referred to as simply `j`

) within the square brackets `[]`

. The *first* argument (similarly, `i`

) tells R which *rows* we’re interested in; by leaving it blank, we’re telling R to operate on *all* rows.

Removing a column from a `data.table`

shares basically the same syntax as adding a columns – the crux is the `:=`

operator. We’ll exclude columns 7 through 10:

`survey[ , 7:10 := NULL]`

There are two new things going on in this compact line:

- On the left-hand side of
`:=`

, we’re adding/subtract*more than one*column. We can tell this by knowing that`7:10`

is really`c(7, 8, 9, 10)`

, which means`:=`

is being done to the 7th, 8th, 9th, and 10th columns. - On the right-hand side of
`:=`

, we use`NULL`

to indicated we’d like to delete the left-hand columns. You should read this as “Columns 7:10 are defined as NULL”, which here is equivalent to deleting them. Note that we only need to say`NULL`

once and it applies to all four columns on the right-hand side.

*Note: in general, it is not recommended to refer to columns by number, since the order might change unexpectedly. It’s safer to refer to columns by their name.*

It’s always a good idea to take a look at your data after loading it.

The functions `head`

and `tail`

, introduced in R Tutorial #1, make this easy:

`head(survey)`

```
## sex credits eye.color handedness height handspan
## 1: Male 5 Brown 1.0 67 20.0
## 2: Female 5 Brown 0.4 63 19.5
## 3: Female NA Brown 0.6 62 19.0
## 4: Female 5 Brown 0.6 65 19.5
## 5: Female 4 Brown 1.0 62 18.5
## 6: Female NA Brown 1.0 68 18.5
## height_handspan_ratio
## 1: 3.350000
## 2: 3.230769
## 3: 3.263158
## 4: 3.333333
## 5: 3.351351
## 6: 3.675676
```

`tail(survey)`

```
## sex credits eye.color handedness height handspan
## 1: Male 4 Hazel 1.00 73 21.0
## 2: Male 5 Brown 0.56 66 20.0
## 3: Female 5 Brown 0.80 63 18.0
## 4: Female 6 Brown 0.80 64 21.0
## 5: Female 6 Brown 1.00 61 17.5
## 6: Male 5 Brown 0.70 71 24.5
## height_handspan_ratio
## 1: 3.476190
## 2: 3.300000
## 3: 3.500000
## 4: 3.047619
## 5: 3.485714
## 6: 2.897959
```

`hist`

The command for a histogram in R is `hist`

. Its input has to be a *vector*, not a `data.table`

. Conveniently, we can use this function in the `j`

argument. For example, we can make a histogram of the column `handedness`

as follows:

`survey[ , hist(handedness)]`

The default title and axis labels aren’t very nice. We can change them using the following options:

```
survey[ , hist(handedness, xlab = 'Handedness Score',
main = 'Histogram of Handedness Scores',
ylab = '# of Students')]
```

See `?hist`

or `?par`

for a full list of options available during plotting; see also the exercises for a chance to explore some of these.

Recall that R doesn’t care if you use single or double quotes, as long are you’re consistent (the symbol you start with is the symbol you end with).

You can change the number of bins via the argument `breaks`

```
survey[ , hist(handedness, breaks = 20, xlab = 'Handedness Score',
main = 'Histogram of Handedness Scores')]
```

By default, R produces histograms in terms of *frequencies*. That is, it counts the *numbers* of observations in each bin. To change this to *relative frequencies*, that is, proportions, we set the argument `freq`

to `FALSE`

:

```
survey[ , hist(handedness, breaks = 20, freq = FALSE,
xlab = 'Handedness Score', main = 'Histogram of Handedness')]
```

Notice that if you don’t supply a label for the y-axis, R defaults to `"Frequency"`

unless you set `freq = FALSE`

, in which case it uses `"Density"`

. We’ll learn about densities later in the course.

`plot`

The general-purpose command for plotting in R is rather unimaginatively called `plot`

. This is an incredibly powerful command, but we’ll mostly use it to make simple scatter plots. To plot `height`

on the the x-axis and `handspan`

on the y-axis, we use the following command:

`survey[ , plot(height, handspan)]`

If you change the order, you reverse the plot:

`survey[ , plot(handspan, height)]`