Research Plumbing I

Things you never worry about until they break.

Francis J. DiTraglia

University of Oxford

🔧 What is this about?

  • Plumbing isn’t exciting but you sure notice when it breaks!
  • Many “boring” but important tasks in empirical work:
    • Loading, merging, reshaping, and cleaning data
    • Creating tables
    • Writing up results
  • Working the smart way: faster, less error prone, reproducible
  • 💻 Welcome to the computer age!
  • Split across this lecture and another later in the term

R Markdown / Quarto

Dynamic Documents

---
title: "An Example Document"
---

## This is a Slide 
- Markdown is fun! You can write in *italics* or even in **boldface**! 
- Including [links](https://ditraglia.com/erm) is easy too. 

$$
\int_{-\infty}^\infty f(x)\, dx = 1
$$

Look it's some R code!
```{r}
message <- 'hello world'
print(message)
```

This is a Slide

  • Markdown is fun! You can write in italics or even in boldface!
  • Including links is easy too.

\[ \int_{-\infty}^\infty f(x)\, dx = 1 \]

Look it’s some R code!

message <- 'hello world'
print(message)
[1] "hello world"

Quarto versus R Markdown

  • They are extremely similar.
  • Quarto is a newer and more powerful, and has first-class support for languages besides R, including Julia and Python
  • R Markdown is older, better-documented, and less buggy.
  • Code chunks, Markdown, and LaTeX work the same.
  • Configuration details are a bit different.
  • Use whichever you prefer
  • Note that Quarto supports emoji 😍

💪 Exercise (\(\infty\) minutes)

Create a new R Markdown or Quarto document: your choice. I suggest .html output. Use this document to take notes and complete exercises in all future lectures!

Warning

As part of this course you will need to learn how to use either R Markdown or Quarto by reading the documentation linked above. This will include learning how to typeset equations using LaTeX syntax. We will help you if you get stuck!

Loading Data

How should you store your data?

  • Always use flat text, .csv or .tsv, unless you have extremely specialized needs (e.g. GIS)
  • Never use a proprietary binary format like STATA’s .dta
  • Never use Excel

Note

Did you know that STATA .dta files aren’t backwards compatible? A .dta file created with a more recent version of STATA literally cannot be opened with an older version.

👼 Loading the data of the virtuous

  • Some angelic beings store their data in a reasonable way!
  • readr package loads .csv, .tsv and related files.
  • readr is in tidyverse so you already installed it!
  • readr functions are superior to similar base R functions
  • References:

👹 Hell is other people’s data.

tidyverse has two helpful packages for importing the data of damned souls who writhe in eternal torment:

“Core” Tidyverse Packages

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
  • This shows you which tidyverse pacakges are loaded automatically when you run library(tidyverse).
  • Some others, e.g. haven have to be loaded individually.

Example: read_csv()

url <- 'https://ditraglia.com/data/height-handspan.csv'
height_handspan <- read_csv(url)

height_handspan
# A tibble: 326 × 2
   height handspan
    <dbl>    <dbl>
 1     73     22.5
 2     65     17  
 3     69     21  
 4     71     25.5
 5     78     25  
 6     68     20  
 7     75     20.5
 8     72     24.5
 9     63     18  
10     67     21.5
# ℹ 316 more rows

Example - read_dta()

library(haven)
url <- 'https://ditraglia.com/data/lakisha_aer.dta'
lakisha <- read_dta(url)
lakisha
# A tibble: 4,870 × 65
   id    ad    education ofjobs yearsexp honors volunteer military empholes
   <chr> <chr>     <dbl>  <dbl>    <dbl>  <dbl>     <dbl>    <dbl>    <dbl>
 1 b     1             4      2        6      0         0        0        1
 2 b     1             3      3        6      0         1        1        0
 3 b     1             4      1        6      0         0        0        0
 4 b     1             3      4        6      0         1        0        1
 5 b     1             3      3       22      0         0        0        0
 6 b     1             4      2        6      1         0        0        0
 7 b     1             4      2        5      0         1        0        0
 8 b     1             3      4       21      0         1        0        1
 9 b     1             4      3        3      0         0        0        0
10 b     1             4      2        6      0         1        0        0
# ℹ 4,860 more rows
# ℹ 56 more variables: occupspecific <dbl>, occupbroad <dbl>,
#   workinschool <dbl>, email <dbl>, computerskills <dbl>, specialskills <dbl>,
#   firstname <chr>, sex <chr>, race <chr>, h <dbl>, l <dbl>, call <dbl>,
#   city <chr>, kind <chr>, adid <dbl>, fracblack <dbl>, fracwhite <dbl>,
#   lmedhhinc <dbl>, fracdropout <dbl>, fraccolp <dbl>, linc <dbl>, col <dbl>,
#   expminreq <chr>, schoolreq <chr>, eoe <dbl>, parent_sales <dbl>, …

Getting and Setting File Paths

  • getwd() returns the current working directory
  • setwd() changes the current working directory
  • On Windows you have two options:
    1. Replace backslashes with forward slashes: "C:/Users/username/Documents/data.csv"
    2. Use double backslashes: "C:\\Users\\username\\Documents\\data.csv"
  • Better plan: use R Studio Projects and relative paths.

💪 Exercise A - (8 min)

  1. Download https://ditraglia.com/data/STAR.csv and save this file on your local machine. Then load it with read_csv(). Note that this will require you to specify the path to this file on your local machine.
  2. The file final5.dta from the Angrist data archive contains data from the article “Using Maimonides Rule to estimate the Effect of Class Size on Student Achievement” by Angrist & Lavy. Locate and download this file. Then try to load it with read_dta(). You may get an error. If so, see the section “Character encoding” in the associated R help file and follow the instructions given there.

Wrangling Data

gradebook

library(tidyverse)
set.seed(92815)
gradebook <- tibble(
  student_id = c(192297, 291857, 500286, 449192, 372152, 627561), 
  name = c('Alice', 'Bob', 'Charlotte', 'Dante', 
           'Ethelburga', 'Felix'),
  quiz1 = round(rnorm(6, 65, 15)),
  quiz2 = round(rnorm(6, 88, 5)),
  quiz3 = round(rnorm(6, 75, 10)),
  midterm1 = round(rnorm(6, 75, 10)),
  midterm2 = round(rnorm(6, 80, 8)), 
  final = round(rnorm(6, 78, 11)))
gradebook
# A tibble: 6 × 8
  student_id name       quiz1 quiz2 quiz3 midterm1 midterm2 final
       <dbl> <chr>      <dbl> <dbl> <dbl>    <dbl>    <dbl> <dbl>
1     192297 Alice         64    96    68       81       90    99
2     291857 Bob           58    91    91       75       75    79
3     500286 Charlotte     70    94    71       81       70    74
4     449192 Dante         57    85    84       83       94    83
5     372152 Ethelburga    74    91    70       63       73    96
6     627561 Felix         77    86    68       78       83    75

tidyselect in dplyr

  • So far used select() by writing column names in full
  • This is cumbersome when working with large and complicated tibbles.
  • tidyselect provides a helpful collection of functions and operators to make selection easier, faster and more flexible.
  • For more details see the documentation.

tidyselect with gradebook

Selecting the quiz columns:

gradebook |> 
  select(quiz1, quiz2, quiz3)
# A tibble: 6 × 3
  quiz1 quiz2 quiz3
  <dbl> <dbl> <dbl>
1    64    96    68
2    58    91    91
3    70    94    71
4    57    85    84
5    74    91    70
6    77    86    68
  • Easy enough in this case, but once I taught a course that had eleven quizzes!
  • Tedious to type them all out, and easy to make a mistake and forget one.

starts_with()

Select columns based on a common prefix

gradebook |> 
  select(starts_with('quiz'))
# A tibble: 6 × 3
  quiz1 quiz2 quiz3
  <dbl> <dbl> <dbl>
1    64    96    68
2    58    91    91
3    70    94    71
4    57    85    84
5    74    91    70
6    77    86    68

ends_with()

Select columns based on a common suffix

gradebook |> 
  select(ends_with('2'))
# A tibble: 6 × 2
  quiz2 midterm2
  <dbl>    <dbl>
1    96       90
2    91       75
3    94       70
4    85       94
5    91       73
6    86       83

contains()

Select columns that contain a particular string

gradebook |> 
  select(contains('iz'))
# A tibble: 6 × 3
  quiz1 quiz2 quiz3
  <dbl> <dbl> <dbl>
1    64    96    68
2    58    91    91
3    70    94    71
4    57    85    84
5    74    91    70
6    77    86    68

num_range()

Select based on both a prefix and a numeric range

gradebook |> 
  select(num_range('quiz', 2:3))
# A tibble: 6 × 2
  quiz2 quiz3
  <dbl> <dbl>
1    96    68
2    91    91
3    94    71
4    85    84
5    91    70
6    86    68

Note

If you know regular expressions, check out the selection helper matches().

where()

Accepts a function as input; applies to every column of the tibble and returns those where the function returns TRUE

gradebook |> 
  select(where(is.numeric))
# A tibble: 6 × 7
  student_id quiz1 quiz2 quiz3 midterm1 midterm2 final
       <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl> <dbl>
1     192297    64    96    68       81       90    99
2     291857    58    91    91       75       75    79
3     500286    70    94    71       81       70    74
4     449192    57    85    84       83       94    83
5     372152    74    91    70       63       73    96
6     627561    77    86    68       78       83    75

💪 Exercise B - (8 min)

  1. Use ends_with() to select the columns quiz2 and midterm2 from gradebook with a minimum of typing.
  2. Use contains() to select the columns whose names contain the abbreviation for “Empirical Research Methods.”
  3. May the 4th be with you (belatedly)! The dplyr package includes a built-in dataset called starwars. Use the glimpse() function to get a quick overview of this dataset, and then read the associated help file before completing the following:
    1. Select only the columns of starwars that contain character data.
    2. Select only the columns whose names contain an underscore.
    3. Select only the columns that are either numeric or whose names end with “color.”

Computing the average on each quiz

It’s easy to compute the average on each quiz:

gradebook |> 
  summarize(quiz1_avg = mean(quiz1),  
            quiz2_avg = mean(quiz2),  
            quiz3_avg = mean(quiz3)) 
# A tibble: 1 × 3
  quiz1_avg quiz2_avg quiz3_avg
      <dbl>     <dbl>     <dbl>
1      66.7      90.5      75.3

But would you really want to type this out eleven times in a course with that many quizzes?!

Column-wise Operations: across()

gradebook |> 
  summarize(across(starts_with('quiz'), mean, .names = '{.col}_avg'))
# A tibble: 1 × 3
  quiz1_avg quiz2_avg quiz3_avg
      <dbl>     <dbl>     <dbl>
1      66.7      90.5      75.3
  • 1st Argument: .cols specifies columns to work with
    • Specify explicitly e.g. c('col1name', 'col2name')
    • Or use use tidyselect
  • 2nd Argument: .fns specifies function(s) to apply
  • 3rd Argument (optional): .names sets rule for naming transformed columns, using syntax from glue package

Supplying a Custom Function

zscore <- function(x) {
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}

gradebook |> 
  mutate(across(starts_with('quiz'), zscore, 
                .names = '{.col}_zscore')) |> 
  select(ends_with('zscore'))
# A tibble: 6 × 3
  quiz1_zscore quiz2_zscore quiz3_zscore
         <dbl>        <dbl>        <dbl>
1       -0.320        1.27        -0.752
2       -1.04         0.116        1.61 
3        0.400        0.809       -0.444
4       -1.16        -1.27         0.889
5        0.880        0.116       -0.547
6        1.24        -1.04        -0.752

Supplying a list of functions

mean_var <- list(
  mean = \(x) mean(x, na.rm = TRUE),
  var = \(x) var(x, na.rm = TRUE)
)

gradebook |> 
  summarize(across(starts_with('midterm'), mean_var, 
                   .names = '{.col}_{.fn}'))
# A tibble: 1 × 4
  midterm1_mean midterm1_var midterm2_mean midterm2_var
          <dbl>        <dbl>         <dbl>        <dbl>
1          76.8         53.8          80.8         95.0

💪 Exercise C - (10 min)

  1. Create a table of sample standard deviations for each of the quizzes in gradebook, where the columns are named according to [COLUMN NAME]_sd.
  2. Read the help file for the function n_distinct() in dplyr. Use this function to count up the number of distinct values in each column of starwars that contains character data. Name your results according to n_[COLUMN NAME]s.
  3. Read the help file for the dplyr function n(). Combine it with across() and other dplyr functions you have learned to display the following table. Each row should correspond to a homeworld that occurs at least twice in the starwars tibble. There should be three columns, counting up the number of distinct values of sex, species, and eye_color. What happens to the observations for which homeworld is missing?
  4. For each species with at least two observations, calculate the sample median of all the numeric columns in starwars, dropping any missing observations. Why do we obtain the result that we do for members of the “Kaminoan” species?
  5. Calculate the std. dev. and interquartile range of all numeric columns of starwars, dropping missing observations. Attach meaningful names to your results.

Example: STAR Project

  • STAR: “Student-Teacher Achievement Ratio”
  • Four-year longitudinal study (panel data)
  • 11,601 Students in Tennessee from 1985-1989
  • Randomly assign students to classes of different sizes.
  • Causal effect of class size in early grades?
  • Mosteller (1997)

STAR Dataset

Name Description
race Student’s race
classtype Type of kindergarten class
g4math Math test score in 4th grade
g4reading Reading test score in 4th grade
yearssmall Number of years in small classes
hsgrad High school graduation (did graduate = 1)

STAR Dataset

library(tidyverse)
star <- read_csv('https://ditraglia.com/data/STAR.csv')
star
# A tibble: 6,325 × 6
    race classtype yearssmall hsgrad g4math g4reading
   <dbl>     <dbl>      <dbl>  <dbl>  <dbl>     <dbl>
 1     1         3          0     NA     NA        NA
 2     2         3          0     NA    706       661
 3     1         3          0      1    711       750
 4     2         1          4     NA    672       659
 5     1         2          0     NA     NA        NA
 6     1         3          0     NA     NA        NA
 7     1         1          4     NA    668       657
 8     1         3          0     NA     NA        NA
 9     1         1          4      1    709       725
10     1         2          0      1    698       692
# ℹ 6,315 more rows
  • How are we supposed to tell what these values mean?!
  • Recode to make the dataset self-documenting

Variables to Recode

Name Description
race White = 1, Black = 2, Asian = 3, Hispanic = 4, Native American = 5, Others = 6
classtype Small = 1, Regular = 2, Regular with Aid = 3
hsgrad Did graduate = 1, Did not graduate = 0

Recoding Variables

star <- star |> 
  mutate(classtype = case_match(classtype,
                                1 ~ 'small',
                                2 ~ 'regular',
                                3 ~ 'regular+aid'))
star
# A tibble: 6,325 × 6
    race classtype   yearssmall hsgrad g4math g4reading
   <dbl> <chr>            <dbl>  <dbl>  <dbl>     <dbl>
 1     1 regular+aid          0     NA     NA        NA
 2     2 regular+aid          0     NA    706       661
 3     1 regular+aid          0      1    711       750
 4     2 small                4     NA    672       659
 5     1 regular              0     NA     NA        NA
 6     1 regular+aid          0     NA     NA        NA
 7     1 small                4     NA    668       657
 8     1 regular+aid          0     NA     NA        NA
 9     1 small                4      1    709       725
10     1 regular              0      1    698       692
# ℹ 6,315 more rows

💪 Exercise D - (3 min)

Recode the race and hsgrad variables from star as indicated above.

Making Tables

Methods for Producing a Table

Ranked from worst to best:

  1. Take a screenshot of STATA output and stretch it so the text is distorted, grainy, and totally unreadable.
  2. Copy-and-paste individual results into Excel table, edit manually, and paste it into a Word document.
  3. Automatically export regression results into Excel, manually edit the table, and paste into a Word document.
  4. Use reproducible tools like knitr::kable, gt and modelsummary to generate and format tables, and include them in any document format.

Publication Quality Tables

datasummary_skim()

  • Just pass it a tibble to get a simple table of summary statistics!
  • If you don’t want the histograms set histogram = FALSE
library(modelsummary)
datasummary_skim(star)
Unique (#) Missing (%) Mean SD Min Median Max
yearssmall 5 0 1.0 1.5 0.0 0.0 4.0
g4math 180 62 708.8 43.1 487.0 710.0 821.0
g4reading 190 63 721.2 52.4 528.0 723.0 836.0

Categorical Variables

  • Notice that datasummary_skim() dropped all the categorical variables.
  • This is because standard deviation / mean don’t really make sense for these.
  • If you have string or factor it does the right thing; if you use numbers as the original star it doesn’t know what do to.
datasummary_skim(star, type = 'categorical')

N %
race Asian 14 0.2
Black 2058 32.5
Hispanic 5 0.1
Native American 2 0.0
Other 9 0.1
White 4234 66.9
NA 3 0.0
classtype regular 2194 34.7
regular+aid 2231 35.3
small 1900 30.0
hsgrad graduate 2539 40.1
non-graduate 508 8.0
NA 3278 51.8

datasummary_balance()

  • Compare summary statistics across categories defined by a “grouping” variable
  • Syntax: datasummary_balance( ~ [GROUPING_VAR], data)
star |> 
  select(yearssmall, g4math, g4reading, classtype) |>  
  datasummary_balance(~ classtype, data = _)

datasummary_balance()

regular (N=2194)
regular+aid (N=2231)
small (N=1900)
Mean Std. Dev. Mean Std. Dev. Mean Std. Dev.
yearssmall 0.2 0.7 0.2 0.7 2.7 1.3
g4math 709.5 41.0 707.6 44.7 709.2 43.6
g4reading 719.9 53.2 720.7 52.4 723.4 51.5

Custom Tables with knitr::kable

my_table <- star |> 
  group_by(classtype) |> 
  summarize(across(starts_with('g4'), \(x) mean(x, na.rm = TRUE),
                   .names = '{.col}_avg')) 
my_table
# A tibble: 3 × 3
  classtype   g4math_avg g4reading_avg
  <chr>            <dbl>         <dbl>
1 regular           710.          720.
2 regular+aid       708.          721.
3 small             709.          723.
my_table |> 
  knitr::kable(digits = 1, caption = 'Average grade 4 test scores',
               col.names = c('Class Type', 'Math', 'Reading'))
Average grade 4 test scores
Class Type Math Reading
regular 709.5 719.9
regular+aid 707.6 720.7
small 709.2 723.4