library(tidyverse)
library(haven)
# Part 1
<- read_csv('https://ditraglia.com/data/STAR.csv')
star
# Part 2
<- read_dta('https://ditraglia.com/data/final5.dta',
final5 encoding = 'latin1')
Research Plumbing I - Solutions
Exercise A - (8 min)
- Download https://ditraglia.com/data/STAR.csv and save this file on your local machine. Then load it with
read_csv()
. Note that this will require you to specify the path to this file on your local machine. - The file
final5.dta
from the Angrist data archive contains data from the article “Using Maimonides Rule to estimate the Effect of Class Size on Student Achievement” by Angrist & Lavy. Locate and download this file. Then try to load it withread_dta()
. You will get an error. Consult the section “Character encoding” in the associated R help file and follow the instructions given there.
Solution
I can’t show a solution with the path on your personallocal machine, so I’ll read these datasets directly from the internet instead:
Exercise B - (8 min)
- Use
ends_with()
to select the columnsquiz2
andmidterm2
fromgradebook
with a minimum of typing. - Use
contains()
to select the columns whose names contain the abbreviation for “Empirical Research Methods.” - May the 4th be with you (belatedly)! The
dplyr
package includes a built-in dataset calledstarwars
. Use theglimpse()
function to get a quick overview of this dataset, and then read the associated help file before completing the following:- Select only the columns of
starwars
that contain character data. - Select only the columns whose names contain an underscore.
- Select only the columns that are either numeric or whose names end with “color.”
- Select only the columns of
Solution
set.seed(92815)
<- tibble(
gradebook student_id = c(192297, 291857, 500286, 449192, 372152, 627561),
name = c('Alice', 'Bob', 'Charlotte', 'Dante',
'Ethelburga', 'Felix'),
quiz1 = round(rnorm(6, 65, 15)),
quiz2 = round(rnorm(6, 88, 5)),
quiz3 = round(rnorm(6, 75, 10)),
midterm1 = round(rnorm(6, 75, 10)),
midterm2 = round(rnorm(6, 80, 8)),
final = round(rnorm(6, 78, 11)))
# Part 1
|>
gradebook select(ends_with('2'))
# A tibble: 6 × 2
quiz2 midterm2
<dbl> <dbl>
1 96 90
2 91 75
3 94 70
4 85 94
5 91 73
6 86 83
# Part 2
|>
gradebook select(contains('erm'))
# A tibble: 6 × 2
midterm1 midterm2
<dbl> <dbl>
1 81 90
2 75 75
3 81 70
4 83 94
5 63 73
6 78 83
# Part 3a
|>
starwars select(where(is.character))
# A tibble: 87 × 8
name hair_color skin_color eye_color sex gender homeworld species
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Luke Skywalker blond fair blue male mascu… Tatooine Human
2 C-3PO <NA> gold yellow none mascu… Tatooine Droid
3 R2-D2 <NA> white, bl… red none mascu… Naboo Droid
4 Darth Vader none white yellow male mascu… Tatooine Human
5 Leia Organa brown light brown fema… femin… Alderaan Human
6 Owen Lars brown, gr… light blue male mascu… Tatooine Human
7 Beru Whitesun… brown light blue fema… femin… Tatooine Human
8 R5-D4 <NA> white, red red none mascu… Tatooine Droid
9 Biggs Darklig… black light brown male mascu… Tatooine Human
10 Obi-Wan Kenobi auburn, w… fair blue-gray male mascu… Stewjon Human
# ℹ 77 more rows
# Part 3b
|>
starwars select(contains('_'))
# A tibble: 87 × 4
hair_color skin_color eye_color birth_year
<chr> <chr> <chr> <dbl>
1 blond fair blue 19
2 <NA> gold yellow 112
3 <NA> white, blue red 33
4 none white yellow 41.9
5 brown light brown 19
6 brown, grey light blue 52
7 brown light blue 47
8 <NA> white, red red NA
9 black light brown 24
10 auburn, white fair blue-gray 57
# ℹ 77 more rows
# Part 3c
|>
starwars select(ends_with('color') | where(is.numeric))
# A tibble: 87 × 6
hair_color skin_color eye_color height mass birth_year
<chr> <chr> <chr> <int> <dbl> <dbl>
1 blond fair blue 172 77 19
2 <NA> gold yellow 167 75 112
3 <NA> white, blue red 96 32 33
4 none white yellow 202 136 41.9
5 brown light brown 150 49 19
6 brown, grey light blue 178 120 52
7 brown light blue 165 75 47
8 <NA> white, red red 97 32 NA
9 black light brown 183 84 24
10 auburn, white fair blue-gray 182 77 57
# ℹ 77 more rows
Exercise C - (10 min)
- Create a table of sample standard deviations for each of the quizzes in
gradebook
, where the columns are named according to[COLUMN NAME]_sd
. - Read the help file for the function
n_distinct()
indplyr
. Use this function to count up the number of distinct values in each column ofstarwars
that contains character data. Name your results according ton_[COLUMN NAME]s
. - Read the help file for the
dplyr
functionn()
. Combine it with across() and other dplyr functions you have learned to display the following table. Each row should correspond to ahomeworld
that occurs at least twice in thestarwars
tibble. There should be three columns, counting up the number of distinct values ofsex
,species
, andeye_color
. What happens to the observations for whichhomeworld
is missing? - For each species with at least two observations, calculate the sample median of all the numeric columns in
starwars
, dropping any missing observations. Why do we obtain the result that we do for members of the “Kaminoan” species? - Calculate the std. dev. and interquartile range of all numeric columns of
starwars
, dropping missing observations. Attach meaningful names to your results.
Solution
# Part 1
|>
gradebook summarize(across(starts_with('quiz'), sd, .names = '{.col}_sd'))
# A tibble: 1 × 3
quiz1_sd quiz2_sd quiz3_sd
<dbl> <dbl> <dbl>
1 8.33 4.32 9.75
# Part 2
|>
starwars summarize(across(where(is.character), n_distinct, .names = 'n_{.col}s'))
# A tibble: 1 × 8
n_names n_hair_colors n_skin_colors n_eye_colors n_sexs n_genders n_homeworlds
<int> <int> <int> <int> <int> <int> <int>
1 87 13 31 15 5 3 49
# ℹ 1 more variable: n_speciess <int>
# Part 3
|>
starwars group_by(homeworld) |>
filter(n() > 1) |>
summarize(across(c(sex, species, eye_color), n_distinct))
# A tibble: 10 × 4
homeworld sex species eye_color
<chr> <int> <int> <int>
1 Alderaan 2 1 1
2 Corellia 1 1 2
3 Coruscant 2 2 1
4 Kamino 2 2 2
5 Kashyyyk 1 1 1
6 Mirial 1 1 1
7 Naboo 4 4 5
8 Ryloth 2 1 2
9 Tatooine 3 2 4
10 <NA> 4 4 8
|>
starwars group_by(homeworld) |>
filter(n() > 1) |>
summarize(across(c(sex, species, eye_color), n_distinct))
# A tibble: 10 × 4
homeworld sex species eye_color
<chr> <int> <int> <int>
1 Alderaan 2 1 1
2 Corellia 1 1 2
3 Coruscant 2 2 1
4 Kamino 2 2 2
5 Kashyyyk 1 1 1
6 Mirial 1 1 1
7 Naboo 4 4 5
8 Ryloth 2 1 2
9 Tatooine 3 2 4
10 <NA> 4 4 8
# Part 4
|>
starwars group_by(species) |>
filter(n() > 1) |>
summarize(across(where(is.numeric), \(x) median(x, na.rm = TRUE)))
# A tibble: 9 × 4
species height mass birth_year
<chr> <dbl> <dbl> <dbl>
1 Droid 97 53.5 33
2 Gungan 206 74 52
3 Human 180 79 48
4 Kaminoan 221 88 NA
5 Mirialan 168 53.1 49
6 Twi'lek 179 55 48
7 Wookiee 231 124 200
8 Zabrak 173 80 54
9 <NA> 183 48 62
|>
starwars filter(species == 'Kaminoan')
# A tibble: 2 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Lama Su 229 88 none grey black NA male mascul…
2 Taun We 213 NA none grey black NA female femini…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
# Part 5
<- list(
SD_IQR SD = \(x) sd(x, na.rm = TRUE),
IQR = \(x) IQR(x, na.rm = TRUE)
)|>
starwars summarize(across(where(is.numeric), SD_IQR, .names = '{.col}_{.fn}'))
# A tibble: 1 × 6
height_SD height_IQR mass_SD mass_IQR birth_year_SD birth_year_IQR
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 34.8 24 169. 28.9 155. 37
Exercise D - (3 min)
Recode the race
and hsgrad
variables from star
as indicated above.
Solution
<- star |>
star mutate(classtype = case_match(classtype,
1 ~ 'small',
2 ~ 'regular',
3 ~ 'regular+aid'),
race = case_match(race,
1 ~ 'White',
2 ~ 'Black',
3 ~ 'Asian',
4 ~ 'Hispanic',
5 ~ 'Native American',
6 ~ 'Other'),
hsgrad = if_else(hsgrad == 1, 'graduate', 'non-graduate'))