Research Plumbing I

Things you never worry about until they break.

Francis J. DiTraglia

University of Oxford

🔧 What is this about?

Plumbing isn’t exciting but you sure notice when it breaks!
Many “boring” but important tasks in empirical work:
- Loading, merging, reshaping, and cleaning data
- Creating tables
- Writing up results
Working the smart way: faster, less error prone, reproducible
💻 Welcome to the computer age!
Split across this lecture and another later in the term

`R Markdown` / `Quarto`

Dynamic Documents

I write these slides using a mix of markdown, R, and LaTeX.
Behind the scenes, pandoc and knitr dynamically convert everything into reveal.js slides.
I could just have easily generated beamer slides like these ones or a LaTeX Document like this one.
If I were less virtuous, I could even have generated a Microsoft Word Document!
I can do all this, and more, with hardly any effort using the wonderful tools Quarto and R Markdown.

---
title: "An Example Document"
---

## This is a Slide 
- Markdown is fun! You can write in *italics* or even in **boldface**! 
- Including [links](https://ditraglia.com/erm) is easy too. 

$$
\int_{-\infty}^\infty f(x)\, dx = 1
$$

Look it's some R code!
```{r}
message <- 'hello world'
print(message)
```

This is a Slide

Markdown is fun! You can write in italics or even in boldface!
Including links is easy too.

\[ \int_{-\infty}^\infty f(x)\, dx = 1 \]

Look it’s some R code!

message <- 'hello world'
print(message)

[1] "hello world"

Quarto versus R Markdown

They are extremely similar.
Quarto is a newer and more powerful, and has first-class support for languages besides R, including Julia and Python
R Markdown is older, better-documented, and less buggy.
Code chunks, Markdown, and LaTeX work the same.
Configuration details are a bit different.
Use whichever you prefer
Note that Quarto supports emoji 😍

💪 Exercise (\(\infty\) minutes)

Create a new R Markdown or Quarto document: your choice. I suggest .html output. Use this document to take notes and complete exercises in all future lectures!

Warning

As part of this course you will need to learn how to use either R Markdown or Quarto by reading the documentation linked above. This will include learning how to typeset equations using LaTeX syntax. We will help you if you get stuck!

Loading Data

How should you store your data?

Always use flat text, .csv or .tsv, unless you have extremely specialized needs (e.g. GIS)
Never use a proprietary binary format like STATA’s .dta
Never use Excel

Note

Did you know that STATA .dta files aren’t backwards compatible? A .dta file created with a more recent version of STATA literally cannot be opened with an older version.

👼 Loading the data of the virtuous

Some angelic beings store their data in a reasonable way!
readr package loads .csv, .tsv and related files.
readr is in tidyverse so you already installed it!
readr functions are superior to similar base R functions
References:
- Data Import Chapter from R for Data Science
- readr cheat sheet
- readr website

👹 Hell is other people’s data.

tidyverse has two helpful packages for importing the data of damned souls who writhe in eternal torment:

readxl imports .xls and xlsx files
- readxl website
- readxl cheatsheet
haven imports SAS, SPSS, and Stata files
- haven website

“Core” Tidyverse Packages

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

This shows you which tidyverse pacakges are loaded automatically when you run library(tidyverse).
Some others, e.g. haven have to be loaded individually.

Example: `read_csv()`

url <- 'https://ditraglia.com/data/height-handspan.csv'
height_handspan <- read_csv(url)

height_handspan

# A tibble: 326 × 2
   height handspan
    <dbl>    <dbl>
 1     73     22.5
 2     65     17  
 3     69     21  
 4     71     25.5
 5     78     25  
 6     68     20  
 7     75     20.5
 8     72     24.5
 9     63     18  
10     67     21.5
# ℹ 316 more rows

Example - `read_dta()`

library(haven)
url <- 'https://ditraglia.com/data/lakisha_aer.dta'
lakisha <- read_dta(url)
lakisha

# A tibble: 4,870 × 65
   id    ad    education ofjobs yearsexp honors volunteer military empholes
   <chr> <chr>     <dbl>  <dbl>    <dbl>  <dbl>     <dbl>    <dbl>    <dbl>
 1 b     1             4      2        6      0         0        0        1
 2 b     1             3      3        6      0         1        1        0
 3 b     1             4      1        6      0         0        0        0
 4 b     1             3      4        6      0         1        0        1
 5 b     1             3      3       22      0         0        0        0
 6 b     1             4      2        6      1         0        0        0
 7 b     1             4      2        5      0         1        0        0
 8 b     1             3      4       21      0         1        0        1
 9 b     1             4      3        3      0         0        0        0
10 b     1             4      2        6      0         1        0        0
# ℹ 4,860 more rows
# ℹ 56 more variables: occupspecific <dbl>, occupbroad <dbl>,
#   workinschool <dbl>, email <dbl>, computerskills <dbl>, specialskills <dbl>,
#   firstname <chr>, sex <chr>, race <chr>, h <dbl>, l <dbl>, call <dbl>,
#   city <chr>, kind <chr>, adid <dbl>, fracblack <dbl>, fracwhite <dbl>,
#   lmedhhinc <dbl>, fracdropout <dbl>, fraccolp <dbl>, linc <dbl>, col <dbl>,
#   expminreq <chr>, schoolreq <chr>, eoe <dbl>, parent_sales <dbl>, …

Getting and Setting File Paths

getwd() returns the current working directory
setwd() changes the current working directory
On Windows you have two options:
1. Replace backslashes with forward slashes: "C:/Users/username/Documents/data.csv"
2. Use double backslashes: "C:\\Users\\username\\Documents\\data.csv"
Better plan: use R Studio Projects and relative paths.

💪 Exercise A - (8 min)

Download https://ditraglia.com/data/STAR.csv and save this file on your local machine. Then load it with read_csv(). Note that this will require you to specify the path to this file on your local machine.
The file final5.dta from the Angrist data archive contains data from the article “Using Maimonides Rule to estimate the Effect of Class Size on Student Achievement” by Angrist & Lavy. Locate and download this file. Then try to load it with read_dta(). You may get an error. If so, see the section “Character encoding” in the associated R help file and follow the instructions given there.

Wrangling Data

`gradebook`

library(tidyverse)
set.seed(92815)
gradebook <- tibble(
  student_id = c(192297, 291857, 500286, 449192, 372152, 627561), 
  name = c('Alice', 'Bob', 'Charlotte', 'Dante', 
           'Ethelburga', 'Felix'),
  quiz1 = round(rnorm(6, 65, 15)),
  quiz2 = round(rnorm(6, 88, 5)),
  quiz3 = round(rnorm(6, 75, 10)),
  midterm1 = round(rnorm(6, 75, 10)),
  midterm2 = round(rnorm(6, 80, 8)), 
  final = round(rnorm(6, 78, 11)))
gradebook

# A tibble: 6 × 8
  student_id name       quiz1 quiz2 quiz3 midterm1 midterm2 final
       <dbl> <chr>      <dbl> <dbl> <dbl>    <dbl>    <dbl> <dbl>
1     192297 Alice         64    96    68       81       90    99
2     291857 Bob           58    91    91       75       75    79
3     500286 Charlotte     70    94    71       81       70    74
4     449192 Dante         57    85    84       83       94    83
5     372152 Ethelburga    74    91    70       63       73    96
6     627561 Felix         77    86    68       78       83    75

`tidyselect` in `dplyr`

So far used select() by writing column names in full
This is cumbersome when working with large and complicated tibbles.
tidyselect provides a helpful collection of functions and operators to make selection easier, faster and more flexible.
For more details see the documentation.

`tidyselect` with `gradebook`

Selecting the quiz columns:

gradebook |> 
  select(quiz1, quiz2, quiz3)

# A tibble: 6 × 3
  quiz1 quiz2 quiz3
  <dbl> <dbl> <dbl>
1    64    96    68
2    58    91    91
3    70    94    71
4    57    85    84
5    74    91    70
6    77    86    68

Easy enough in this case, but once I taught a course that had eleven quizzes!
Tedious to type them all out, and easy to make a mistake and forget one.

`starts_with()`

Select columns based on a common prefix

gradebook |> 
  select(starts_with('quiz'))

# A tibble: 6 × 3
  quiz1 quiz2 quiz3
  <dbl> <dbl> <dbl>
1    64    96    68
2    58    91    91
3    70    94    71
4    57    85    84
5    74    91    70
6    77    86    68

`ends_with()`

Select columns based on a common suffix

gradebook |> 
  select(ends_with('2'))

# A tibble: 6 × 2
  quiz2 midterm2
  <dbl>    <dbl>
1    96       90
2    91       75
3    94       70
4    85       94
5    91       73
6    86       83

`contains()`

Select columns that contain a particular string

gradebook |> 
  select(contains('iz'))

# A tibble: 6 × 3
  quiz1 quiz2 quiz3
  <dbl> <dbl> <dbl>
1    64    96    68
2    58    91    91
3    70    94    71
4    57    85    84
5    74    91    70
6    77    86    68

`num_range()`

Select based on both a prefix and a numeric range

gradebook |> 
  select(num_range('quiz', 2:3))

# A tibble: 6 × 2
  quiz2 quiz3
  <dbl> <dbl>
1    96    68
2    91    91
3    94    71
4    85    84
5    91    70
6    86    68

Note

If you know regular expressions, check out the selection helper matches().

`where()`

Accepts a function as input; applies to every column of the tibble and returns those where the function returns TRUE

gradebook |> 
  select(where(is.numeric))

# A tibble: 6 × 7
  student_id quiz1 quiz2 quiz3 midterm1 midterm2 final
       <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl> <dbl>
1     192297    64    96    68       81       90    99
2     291857    58    91    91       75       75    79
3     500286    70    94    71       81       70    74
4     449192    57    85    84       83       94    83
5     372152    74    91    70       63       73    96
6     627561    77    86    68       78       83    75

💪 Exercise B - (8 min)

Use ends_with() to select the columns quiz2 and midterm2 from gradebook with a minimum of typing.
Use contains() to select the columns whose names contain the abbreviation for “Empirical Research Methods.”
May the 4th be with you (belatedly)! The dplyr package includes a built-in dataset called starwars. Use the glimpse() function to get a quick overview of this dataset, and then read the associated help file before completing the following:
1. Select only the columns of starwars that contain character data.
2. Select only the columns whose names contain an underscore.
3. Select only the columns that are either numeric or whose names end with “color.”

Computing the average on each quiz

It’s easy to compute the average on each quiz:

gradebook |> 
  summarize(quiz1_avg = mean(quiz1),  
            quiz2_avg = mean(quiz2),  
            quiz3_avg = mean(quiz3))

# A tibble: 1 × 3
  quiz1_avg quiz2_avg quiz3_avg
      <dbl>     <dbl>     <dbl>
1      66.7      90.5      75.3

But would you really want to type this out eleven times in a course with that many quizzes?!

Column-wise Operations: `across()`

gradebook |> 
  summarize(across(starts_with('quiz'), mean, .names = '{.col}_avg'))

# A tibble: 1 × 3
  quiz1_avg quiz2_avg quiz3_avg
      <dbl>     <dbl>     <dbl>
1      66.7      90.5      75.3

1st Argument: .cols specifies columns to work with
- Specify explicitly e.g. c('col1name', 'col2name')
- Or use use tidyselect
2nd Argument: .fns specifies function(s) to apply
3rd Argument (optional): .names sets rule for naming transformed columns, using syntax from glue package

Supplying a Custom Function

zscore <- function(x) {
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}

gradebook |> 
  mutate(across(starts_with('quiz'), zscore, 
                .names = '{.col}_zscore')) |> 
  select(ends_with('zscore'))

# A tibble: 6 × 3
  quiz1_zscore quiz2_zscore quiz3_zscore
         <dbl>        <dbl>        <dbl>
1       -0.320        1.27        -0.752
2       -1.04         0.116        1.61 
3        0.400        0.809       -0.444
4       -1.16        -1.27         0.889
5        0.880        0.116       -0.547
6        1.24        -1.04        -0.752

Supplying a list of functions

mean_var <- list(
  mean = \(x) mean(x, na.rm = TRUE),
  var = \(x) var(x, na.rm = TRUE)
)

gradebook |> 
  summarize(across(starts_with('midterm'), mean_var, 
                   .names = '{.col}_{.fn}'))

# A tibble: 1 × 4
  midterm1_mean midterm1_var midterm2_mean midterm2_var
          <dbl>        <dbl>         <dbl>        <dbl>
1          76.8         53.8          80.8         95.0

💪 Exercise C - (10 min)

Create a table of sample standard deviations for each of the quizzes in gradebook, where the columns are named according to [COLUMN NAME]_sd.
Read the help file for the function n_distinct() in dplyr. Use this function to count up the number of distinct values in each column of starwars that contains character data. Name your results according to n_[COLUMN NAME]s.
Read the help file for the dplyr function n(). Combine it with across() and other dplyr functions you have learned to display the following table. Each row should correspond to a homeworld that occurs at least twice in the starwars tibble. There should be three columns, counting up the number of distinct values of sex, species, and eye_color. What happens to the observations for which homeworld is missing?
For each species with at least two observations, calculate the sample median of all the numeric columns in starwars, dropping any missing observations. Why do we obtain the result that we do for members of the “Kaminoan” species?
Calculate the std. dev. and interquartile range of all numeric columns of starwars, dropping missing observations. Attach meaningful names to your results.

Example: STAR Project

STAR: “Student-Teacher Achievement Ratio”
Four-year longitudinal study (panel data)
11,601 Students in Tennessee from 1985-1989
Randomly assign students to classes of different sizes.
Causal effect of class size in early grades?
Mosteller (1997)

STAR Dataset

Name	Description
`race`	Student’s race
`classtype`	Type of kindergarten class
`g4math`	Math test score in 4th grade
`g4reading`	Reading test score in 4th grade
`yearssmall`	Number of years in small classes
`hsgrad`	High school graduation (did graduate = 1)

STAR Dataset

library(tidyverse)
star <- read_csv('https://ditraglia.com/data/STAR.csv')
star

# A tibble: 6,325 × 6
    race classtype yearssmall hsgrad g4math g4reading
   <dbl>     <dbl>      <dbl>  <dbl>  <dbl>     <dbl>
 1     1         3          0     NA     NA        NA
 2     2         3          0     NA    706       661
 3     1         3          0      1    711       750
 4     2         1          4     NA    672       659
 5     1         2          0     NA     NA        NA
 6     1         3          0     NA     NA        NA
 7     1         1          4     NA    668       657
 8     1         3          0     NA     NA        NA
 9     1         1          4      1    709       725
10     1         2          0      1    698       692
# ℹ 6,315 more rows

How are we supposed to tell what these values mean?!
Recode to make the dataset self-documenting

Variables to Recode

Name	Description
`race`	White = 1, Black = 2, Asian = 3, Hispanic = 4, Native American = 5, Others = 6
`classtype`	Small = 1, Regular = 2, Regular with Aid = 3
`hsgrad`	Did graduate = 1, Did not graduate = 0

Recoding Variables

Use case_match() from dplyr to recode values

star <- star |> 
  mutate(classtype = case_match(classtype,
                                1 ~ 'small',
                                2 ~ 'regular',
                                3 ~ 'regular+aid'))
star

# A tibble: 6,325 × 6
    race classtype   yearssmall hsgrad g4math g4reading
   <dbl> <chr>            <dbl>  <dbl>  <dbl>     <dbl>
 1     1 regular+aid          0     NA     NA        NA
 2     2 regular+aid          0     NA    706       661
 3     1 regular+aid          0      1    711       750
 4     2 small                4     NA    672       659
 5     1 regular              0     NA     NA        NA
 6     1 regular+aid          0     NA     NA        NA
 7     1 small                4     NA    668       657
 8     1 regular+aid          0     NA     NA        NA
 9     1 small                4      1    709       725
10     1 regular              0      1    698       692
# ℹ 6,315 more rows

💪 Exercise D - (3 min)

Recode the race and hsgrad variables from star as indicated above.

Making Tables

Methods for Producing a Table

Ranked from worst to best:

Take a screenshot of STATA output and stretch it so the text is distorted, grainy, and totally unreadable.
Copy-and-paste individual results into Excel table, edit manually, and paste it into a Word document.
Automatically export regression results into Excel, manually edit the table, and paste into a Word document.
Use reproducible tools like knitr::kable, gt and modelsummary to generate and format tables, and include them in any document format.

Publication Quality Tables

Today: datasummary and knitr::kable
Future lectures: modelsummary
Some prefer stargazer. You can use this if you prefer.
For the ultimate in precision table creation, try gt.

`datasummary_skim()`

Just pass it a tibble to get a simple table of summary statistics!
If you don’t want the histograms set histogram = FALSE

library(modelsummary)
datasummary_skim(star)

	Unique (#)	Missing (%)	Mean	SD	Min	Median	Max
yearssmall	5	0	1.0	1.5	0.0	0.0	4.0
g4math	180	62	708.8	43.1	487.0	710.0	821.0
g4reading	190	63	721.2	52.4	528.0	723.0	836.0

Categorical Variables

Notice that datasummary_skim() dropped all the categorical variables.
This is because standard deviation / mean don’t really make sense for these.
If you have string or factor it does the right thing; if you use numbers as the original star it doesn’t know what do to.

datasummary_skim(star, type = 'categorical')

		N	%
race	Asian	14	0.2
	Black	2058	32.5
	Hispanic	5	0.1
	Native American	2	0.0
	Other	9	0.1
	White	4234	66.9
	NA	3	0.0
classtype	regular	2194	34.7
	regular+aid	2231	35.3
	small	1900	30.0
hsgrad	graduate	2539	40.1
	non-graduate	508	8.0
	NA	3278	51.8

`datasummary_balance()`

Compare summary statistics across categories defined by a “grouping” variable
Syntax: datasummary_balance( ~ [GROUPING_VAR], data)

star |> 
  select(yearssmall, g4math, g4reading, classtype) |>  
  datasummary_balance(~ classtype, data = _)

`datasummary_balance()`

	regular (N=2194)		regular+aid (N=2231)		small (N=1900)
	Mean	Std. Dev.	Mean	Std. Dev.	Mean	Std. Dev.
yearssmall	0.2	0.7	0.2	0.7	2.7	1.3
g4math	709.5	41.0	707.6	44.7	709.2	43.6
g4reading	719.9	53.2	720.7	52.4	723.4	51.5

Custom Tables with `knitr::kable`

my_table <- star |> 
  group_by(classtype) |> 
  summarize(across(starts_with('g4'), \(x) mean(x, na.rm = TRUE),
                   .names = '{.col}_avg')) 
my_table

# A tibble: 3 × 3
  classtype   g4math_avg g4reading_avg
  <chr>            <dbl>         <dbl>
1 regular           710.          720.
2 regular+aid       708.          721.
3 small             709.          723.

my_table |> 
  knitr::kable(digits = 1, caption = 'Average grade 4 test scores',
               col.names = c('Class Type', 'Math', 'Reading'))

Average grade 4 test scores
Class Type	Math	Reading
regular	709.5	719.9
regular+aid	707.6	720.7
small	709.2	723.4

Research Plumbing I

🔧 What is this about?

R Markdown / Quarto

Dynamic Documents

This is a Slide

Quarto versus R Markdown

💪 Exercise (\(\infty\) minutes)

Loading Data

How should you store your data?

👼 Loading the data of the virtuous

👹 Hell is other people’s data.

“Core” Tidyverse Packages

Example: read_csv()

Example - read_dta()

Getting and Setting File Paths

💪 Exercise A - (8 min)

Wrangling Data

gradebook

tidyselect in dplyr

tidyselect with gradebook

starts_with()

ends_with()

contains()

num_range()

where()

💪 Exercise B - (8 min)

Computing the average on each quiz

Column-wise Operations: across()

Supplying a Custom Function

Supplying a list of functions

💪 Exercise C - (10 min)

Example: STAR Project

STAR Dataset

STAR Dataset

Variables to Recode

Recoding Variables

💪 Exercise D - (3 min)

Making Tables

Methods for Producing a Table

Publication Quality Tables

datasummary_skim()

Categorical Variables

datasummary_balance()

datasummary_balance()

Custom Tables with knitr::kable

`R Markdown` / `Quarto`

Example: `read_csv()`

Example - `read_dta()`

`gradebook`

`tidyselect` in `dplyr`

`tidyselect` with `gradebook`

`starts_with()`

`ends_with()`

`contains()`

`num_range()`

`where()`

Column-wise Operations: `across()`

`datasummary_skim()`

`datasummary_balance()`

`datasummary_balance()`

Custom Tables with `knitr::kable`