[1] "hello world"
Things you never worry about until they break.
University of Oxford
R Markdown / Quarto---
title: "An Example Document"
---
## This is a Slide
- Markdown is fun! You can write in *italics* or even in **boldface**!
- Including [links](https://ditraglia.com/erm) is easy too.
$$
\int_{-\infty}^\infty f(x)\, dx = 1
$$
Look it's some R code!
```{r}
message <- 'hello world'
print(message)
```\[ \int_{-\infty}^\infty f(x)\, dx = 1 \]
Look it’s some R code!
Create a new R Markdown or Quarto document: your choice. I suggest .html output. Use this document to take notes and complete exercises in all future lectures!
Warning
As part of this course you will need to learn how to use either R Markdown or Quarto by reading the documentation linked above. This will include learning how to typeset equations using LaTeX syntax. We will help you if you get stuck!
.csv or .tsv, unless you have extremely specialized needs (e.g. GIS).dtaNote
Did you know that STATA .dta files aren’t backwards compatible? A .dta file created with a more recent version of STATA literally cannot be opened with an older version.
readr package loads .csv, .tsv and related files.readr is in tidyverse so you already installed it!readr functions are superior to similar base R functionsreadr cheat sheetreadr websitetidyverse has two helpful packages for importing the data of damned souls who writhe in eternal torment:
readxl imports .xls and xlsx files
haven imports SAS, SPSS, and Stata files
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
tidyverse pacakges are loaded automatically when you run library(tidyverse).haven have to be loaded individually.read_csv()read_dta()# A tibble: 4,870 × 65
id ad education ofjobs yearsexp honors volunteer military empholes
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 b 1 4 2 6 0 0 0 1
2 b 1 3 3 6 0 1 1 0
3 b 1 4 1 6 0 0 0 0
4 b 1 3 4 6 0 1 0 1
5 b 1 3 3 22 0 0 0 0
6 b 1 4 2 6 1 0 0 0
7 b 1 4 2 5 0 1 0 0
8 b 1 3 4 21 0 1 0 1
9 b 1 4 3 3 0 0 0 0
10 b 1 4 2 6 0 1 0 0
# ℹ 4,860 more rows
# ℹ 56 more variables: occupspecific <dbl>, occupbroad <dbl>,
# workinschool <dbl>, email <dbl>, computerskills <dbl>, specialskills <dbl>,
# firstname <chr>, sex <chr>, race <chr>, h <dbl>, l <dbl>, call <dbl>,
# city <chr>, kind <chr>, adid <dbl>, fracblack <dbl>, fracwhite <dbl>,
# lmedhhinc <dbl>, fracdropout <dbl>, fraccolp <dbl>, linc <dbl>, col <dbl>,
# expminreq <chr>, schoolreq <chr>, eoe <dbl>, parent_sales <dbl>, …
getwd() returns the current working directorysetwd() changes the current working directory"C:/Users/username/Documents/data.csv""C:\\Users\\username\\Documents\\data.csv"read_csv(). Note that this will require you to specify the path to this file on your local machine.final5.dta from the Angrist data archive contains data from the article “Using Maimonides Rule to estimate the Effect of Class Size on Student Achievement” by Angrist & Lavy. Locate and download this file. Then try to load it with read_dta(). You may get an error. If so, see the section “Character encoding” in the associated R help file and follow the instructions given there.gradebooklibrary(tidyverse)
set.seed(92815)
gradebook <- tibble(
student_id = c(192297, 291857, 500286, 449192, 372152, 627561),
name = c('Alice', 'Bob', 'Charlotte', 'Dante',
'Ethelburga', 'Felix'),
quiz1 = round(rnorm(6, 65, 15)),
quiz2 = round(rnorm(6, 88, 5)),
quiz3 = round(rnorm(6, 75, 10)),
midterm1 = round(rnorm(6, 75, 10)),
midterm2 = round(rnorm(6, 80, 8)),
final = round(rnorm(6, 78, 11)))
gradebook# A tibble: 6 × 8
student_id name quiz1 quiz2 quiz3 midterm1 midterm2 final
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 192297 Alice 64 96 68 81 90 99
2 291857 Bob 58 91 91 75 75 79
3 500286 Charlotte 70 94 71 81 70 74
4 449192 Dante 57 85 84 83 94 83
5 372152 Ethelburga 74 91 70 63 73 96
6 627561 Felix 77 86 68 78 83 75
tidyselect in dplyrselect() by writing column names in fulltidyselect provides a helpful collection of functions and operators to make selection easier, faster and more flexible.tidyselect with gradebookSelecting the quiz columns:
# A tibble: 6 × 3
quiz1 quiz2 quiz3
<dbl> <dbl> <dbl>
1 64 96 68
2 58 91 91
3 70 94 71
4 57 85 84
5 74 91 70
6 77 86 68
starts_with()Select columns based on a common prefix
ends_with()Select columns based on a common suffix
contains()Select columns that contain a particular string
num_range()Select based on both a prefix and a numeric range
# A tibble: 6 × 2
quiz2 quiz3
<dbl> <dbl>
1 96 68
2 91 91
3 94 71
4 85 84
5 91 70
6 86 68
Note
If you know regular expressions, check out the selection helper matches().
where()Accepts a function as input; applies to every column of the tibble and returns those where the function returns TRUE
# A tibble: 6 × 7
student_id quiz1 quiz2 quiz3 midterm1 midterm2 final
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 192297 64 96 68 81 90 99
2 291857 58 91 91 75 75 79
3 500286 70 94 71 81 70 74
4 449192 57 85 84 83 94 83
5 372152 74 91 70 63 73 96
6 627561 77 86 68 78 83 75
ends_with() to select the columns quiz2 and midterm2 from gradebook with a minimum of typing.contains() to select the columns whose names contain the abbreviation for “Empirical Research Methods.”dplyr package includes a built-in dataset called starwars. Use the glimpse() function to get a quick overview of this dataset, and then read the associated help file before completing the following:
starwars that contain character data.It’s easy to compute the average on each quiz:
# A tibble: 1 × 3
quiz1_avg quiz2_avg quiz3_avg
<dbl> <dbl> <dbl>
1 66.7 90.5 75.3
But would you really want to type this out eleven times in a course with that many quizzes?!
across()# A tibble: 1 × 3
quiz1_avg quiz2_avg quiz3_avg
<dbl> <dbl> <dbl>
1 66.7 90.5 75.3
.cols specifies columns to work with
c('col1name', 'col2name')tidyselect.fns specifies function(s) to apply.names sets rule for naming transformed columns, using syntax from glue packagezscore <- function(x) {
(x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}
gradebook |>
mutate(across(starts_with('quiz'), zscore,
.names = '{.col}_zscore')) |>
select(ends_with('zscore'))# A tibble: 6 × 3
quiz1_zscore quiz2_zscore quiz3_zscore
<dbl> <dbl> <dbl>
1 -0.320 1.27 -0.752
2 -1.04 0.116 1.61
3 0.400 0.809 -0.444
4 -1.16 -1.27 0.889
5 0.880 0.116 -0.547
6 1.24 -1.04 -0.752
gradebook, where the columns are named according to [COLUMN NAME]_sd.n_distinct() in dplyr. Use this function to count up the number of distinct values in each column of starwars that contains character data. Name your results according to n_[COLUMN NAME]s.dplyr function n(). Combine it with across() and other dplyr functions you have learned to display the following table. Each row should correspond to a homeworld that occurs at least twice in the starwars tibble. There should be three columns, counting up the number of distinct values of sex, species, and eye_color. What happens to the observations for which homeworld is missing?starwars, dropping any missing observations. Why do we obtain the result that we do for members of the “Kaminoan” species?starwars, dropping missing observations. Attach meaningful names to your results.| Name | Description |
|---|---|
race |
Student’s race |
classtype |
Type of kindergarten class |
g4math |
Math test score in 4th grade |
g4reading |
Reading test score in 4th grade |
yearssmall |
Number of years in small classes |
hsgrad |
High school graduation (did graduate = 1) |
# A tibble: 6,325 × 6
race classtype yearssmall hsgrad g4math g4reading
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 3 0 NA NA NA
2 2 3 0 NA 706 661
3 1 3 0 1 711 750
4 2 1 4 NA 672 659
5 1 2 0 NA NA NA
6 1 3 0 NA NA NA
7 1 1 4 NA 668 657
8 1 3 0 NA NA NA
9 1 1 4 1 709 725
10 1 2 0 1 698 692
# ℹ 6,315 more rows
| Name | Description |
|---|---|
race |
White = 1, Black = 2, Asian = 3, Hispanic = 4, Native American = 5, Others = 6 |
classtype |
Small = 1, Regular = 2, Regular with Aid = 3 |
hsgrad |
Did graduate = 1, Did not graduate = 0 |
case_match() from dplyr to recode valuesstar <- star |>
mutate(classtype = case_match(classtype,
1 ~ 'small',
2 ~ 'regular',
3 ~ 'regular+aid'))
star# A tibble: 6,325 × 6
race classtype yearssmall hsgrad g4math g4reading
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 regular+aid 0 NA NA NA
2 2 regular+aid 0 NA 706 661
3 1 regular+aid 0 1 711 750
4 2 small 4 NA 672 659
5 1 regular 0 NA NA NA
6 1 regular+aid 0 NA NA NA
7 1 small 4 NA 668 657
8 1 regular+aid 0 NA NA NA
9 1 small 4 1 709 725
10 1 regular 0 1 698 692
# ℹ 6,315 more rows
Recode the race and hsgrad variables from star as indicated above.
Ranked from worst to best:
knitr::kable, gt and modelsummary to generate and format tables, and include them in any document format.datasummary and knitr::kablemodelsummarystargazer. You can use this if you prefer.gt.datasummary_skim()histogram = FALSEdatasummary_skim() dropped all the categorical variables.star it doesn’t know what do to.| N | % | ||
|---|---|---|---|
| race | Asian | 14 | 0.2 |
| Black | 2058 | 32.5 | |
| Hispanic | 5 | 0.1 | |
| Native American | 2 | 0.0 | |
| Other | 9 | 0.1 | |
| White | 4234 | 66.9 | |
| NA | 3 | 0.0 | |
| classtype | regular | 2194 | 34.7 |
| regular+aid | 2231 | 35.3 | |
| small | 1900 | 30.0 | |
| hsgrad | graduate | 2539 | 40.1 |
| non-graduate | 508 | 8.0 | |
| NA | 3278 | 51.8 |
datasummary_balance()datasummary_balance( ~ [GROUPING_VAR], data)datasummary_balance()
regular (N=2194)
|
regular+aid (N=2231)
|
small (N=1900)
|
||||
|---|---|---|---|---|---|---|
| Mean | Std. Dev. | Mean | Std. Dev. | Mean | Std. Dev. | |
| yearssmall | 0.2 | 0.7 | 0.2 | 0.7 | 2.7 | 1.3 |
| g4math | 709.5 | 41.0 | 707.6 | 44.7 | 709.2 | 43.6 |
| g4reading | 719.9 | 53.2 | 720.7 | 52.4 | 723.4 | 51.5 |
knitr::kablemy_table <- star |>
group_by(classtype) |>
summarize(across(starts_with('g4'), \(x) mean(x, na.rm = TRUE),
.names = '{.col}_avg'))
my_table# A tibble: 3 × 3
classtype g4math_avg g4reading_avg
<chr> <dbl> <dbl>
1 regular 710. 720.
2 regular+aid 708. 721.
3 small 709. 723.
my_table |>
knitr::kable(digits = 1, caption = 'Average grade 4 test scores',
col.names = c('Class Type', 'Math', 'Reading'))| Class Type | Math | Reading |
|---|---|---|
| regular | 709.5 | 719.9 |
| regular+aid | 707.6 | 720.7 |
| small | 709.2 | 723.4 |