[1] "hello world"
Things you never worry about until they break.
University of Oxford
R Markdown
/ Quarto
---
title: "An Example Document"
---
## This is a Slide
- Markdown is fun! You can write in *italics* or even in **boldface**!
- Including [links](https://ditraglia.com/erm) is easy too.
$$
\int_{-\infty}^\infty f(x)\, dx = 1
$$
Look it's some R code!
```{r}
message <- 'hello world'
print(message)
```
\[ \int_{-\infty}^\infty f(x)\, dx = 1 \]
Look it’s some R code!
Create a new R Markdown or Quarto document: your choice. I suggest .html
output. Use this document to take notes and complete exercises in all future lectures!
Warning
As part of this course you will need to learn how to use either R Markdown or Quarto by reading the documentation linked above. This will include learning how to typeset equations using LaTeX syntax. We will help you if you get stuck!
.csv
or .tsv
, unless you have extremely specialized needs (e.g. GIS).dta
Note
Did you know that STATA .dta
files aren’t backwards compatible? A .dta
file created with a more recent version of STATA literally cannot be opened with an older version.
readr
package loads .csv
, .tsv
and related files.readr
is in tidyverse
so you already installed it!readr
functions are superior to similar base R functionsreadr
cheat sheetreadr
websitetidyverse
has two helpful packages for importing the data of damned souls who writhe in eternal torment:
readxl
imports .xls
and xlsx
files
haven
imports SAS
, SPSS
, and Stata
files
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
tidyverse
pacakges are loaded automatically when you run library(tidyverse)
.haven
have to be loaded individually.read_csv()
read_dta()
# A tibble: 4,870 × 65
id ad education ofjobs yearsexp honors volunteer military empholes
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 b 1 4 2 6 0 0 0 1
2 b 1 3 3 6 0 1 1 0
3 b 1 4 1 6 0 0 0 0
4 b 1 3 4 6 0 1 0 1
5 b 1 3 3 22 0 0 0 0
6 b 1 4 2 6 1 0 0 0
7 b 1 4 2 5 0 1 0 0
8 b 1 3 4 21 0 1 0 1
9 b 1 4 3 3 0 0 0 0
10 b 1 4 2 6 0 1 0 0
# ℹ 4,860 more rows
# ℹ 56 more variables: occupspecific <dbl>, occupbroad <dbl>,
# workinschool <dbl>, email <dbl>, computerskills <dbl>, specialskills <dbl>,
# firstname <chr>, sex <chr>, race <chr>, h <dbl>, l <dbl>, call <dbl>,
# city <chr>, kind <chr>, adid <dbl>, fracblack <dbl>, fracwhite <dbl>,
# lmedhhinc <dbl>, fracdropout <dbl>, fraccolp <dbl>, linc <dbl>, col <dbl>,
# expminreq <chr>, schoolreq <chr>, eoe <dbl>, parent_sales <dbl>, …
getwd()
returns the current working directorysetwd()
changes the current working directory"C:/Users/username/Documents/data.csv"
"C:\\Users\\username\\Documents\\data.csv"
read_csv()
. Note that this will require you to specify the path to this file on your local machine.final5.dta
from the Angrist data archive contains data from the article “Using Maimonides Rule to estimate the Effect of Class Size on Student Achievement” by Angrist & Lavy. Locate and download this file. Then try to load it with read_dta()
. You may get an error. If so, see the section “Character encoding” in the associated R help file and follow the instructions given there.gradebook
library(tidyverse)
set.seed(92815)
gradebook <- tibble(
student_id = c(192297, 291857, 500286, 449192, 372152, 627561),
name = c('Alice', 'Bob', 'Charlotte', 'Dante',
'Ethelburga', 'Felix'),
quiz1 = round(rnorm(6, 65, 15)),
quiz2 = round(rnorm(6, 88, 5)),
quiz3 = round(rnorm(6, 75, 10)),
midterm1 = round(rnorm(6, 75, 10)),
midterm2 = round(rnorm(6, 80, 8)),
final = round(rnorm(6, 78, 11)))
gradebook
# A tibble: 6 × 8
student_id name quiz1 quiz2 quiz3 midterm1 midterm2 final
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 192297 Alice 64 96 68 81 90 99
2 291857 Bob 58 91 91 75 75 79
3 500286 Charlotte 70 94 71 81 70 74
4 449192 Dante 57 85 84 83 94 83
5 372152 Ethelburga 74 91 70 63 73 96
6 627561 Felix 77 86 68 78 83 75
tidyselect
in dplyr
select()
by writing column names in fulltidyselect
provides a helpful collection of functions and operators to make selection easier, faster and more flexible.tidyselect
with gradebook
Selecting the quiz columns:
# A tibble: 6 × 3
quiz1 quiz2 quiz3
<dbl> <dbl> <dbl>
1 64 96 68
2 58 91 91
3 70 94 71
4 57 85 84
5 74 91 70
6 77 86 68
starts_with()
Select columns based on a common prefix
ends_with()
Select columns based on a common suffix
contains()
Select columns that contain a particular string
num_range()
Select based on both a prefix and a numeric range
# A tibble: 6 × 2
quiz2 quiz3
<dbl> <dbl>
1 96 68
2 91 91
3 94 71
4 85 84
5 91 70
6 86 68
Note
If you know regular expressions, check out the selection helper matches()
.
where()
Accepts a function as input; applies to every column of the tibble and returns those where the function returns TRUE
# A tibble: 6 × 7
student_id quiz1 quiz2 quiz3 midterm1 midterm2 final
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 192297 64 96 68 81 90 99
2 291857 58 91 91 75 75 79
3 500286 70 94 71 81 70 74
4 449192 57 85 84 83 94 83
5 372152 74 91 70 63 73 96
6 627561 77 86 68 78 83 75
ends_with()
to select the columns quiz2
and midterm2
from gradebook
with a minimum of typing.contains()
to select the columns whose names contain the abbreviation for “Empirical Research Methods.”dplyr
package includes a built-in dataset called starwars
. Use the glimpse()
function to get a quick overview of this dataset, and then read the associated help file before completing the following:
starwars
that contain character data.It’s easy to compute the average on each quiz:
# A tibble: 1 × 3
quiz1_avg quiz2_avg quiz3_avg
<dbl> <dbl> <dbl>
1 66.7 90.5 75.3
But would you really want to type this out eleven times in a course with that many quizzes?!
across()
# A tibble: 1 × 3
quiz1_avg quiz2_avg quiz3_avg
<dbl> <dbl> <dbl>
1 66.7 90.5 75.3
.cols
specifies columns to work with
c('col1name', 'col2name')
tidyselect
.fns
specifies function(s) to apply.names
sets rule for naming transformed columns, using syntax from glue
packagezscore <- function(x) {
(x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}
gradebook |>
mutate(across(starts_with('quiz'), zscore,
.names = '{.col}_zscore')) |>
select(ends_with('zscore'))
# A tibble: 6 × 3
quiz1_zscore quiz2_zscore quiz3_zscore
<dbl> <dbl> <dbl>
1 -0.320 1.27 -0.752
2 -1.04 0.116 1.61
3 0.400 0.809 -0.444
4 -1.16 -1.27 0.889
5 0.880 0.116 -0.547
6 1.24 -1.04 -0.752
gradebook
, where the columns are named according to [COLUMN NAME]_sd
.n_distinct()
in dplyr
. Use this function to count up the number of distinct values in each column of starwars
that contains character data. Name your results according to n_[COLUMN NAME]s
.dplyr
function n()
. Combine it with across() and other dplyr functions you have learned to display the following table. Each row should correspond to a homeworld
that occurs at least twice in the starwars
tibble. There should be three columns, counting up the number of distinct values of sex
, species
, and eye_color
. What happens to the observations for which homeworld
is missing?starwars
, dropping any missing observations. Why do we obtain the result that we do for members of the “Kaminoan” species?starwars
, dropping missing observations. Attach meaningful names to your results.Name | Description |
---|---|
race |
Student’s race |
classtype |
Type of kindergarten class |
g4math |
Math test score in 4th grade |
g4reading |
Reading test score in 4th grade |
yearssmall |
Number of years in small classes |
hsgrad |
High school graduation (did graduate = 1) |
# A tibble: 6,325 × 6
race classtype yearssmall hsgrad g4math g4reading
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 3 0 NA NA NA
2 2 3 0 NA 706 661
3 1 3 0 1 711 750
4 2 1 4 NA 672 659
5 1 2 0 NA NA NA
6 1 3 0 NA NA NA
7 1 1 4 NA 668 657
8 1 3 0 NA NA NA
9 1 1 4 1 709 725
10 1 2 0 1 698 692
# ℹ 6,315 more rows
Name | Description |
---|---|
race |
White = 1, Black = 2, Asian = 3, Hispanic = 4, Native American = 5, Others = 6 |
classtype |
Small = 1, Regular = 2, Regular with Aid = 3 |
hsgrad |
Did graduate = 1, Did not graduate = 0 |
case_match()
from dplyr
to recode valuesstar <- star |>
mutate(classtype = case_match(classtype,
1 ~ 'small',
2 ~ 'regular',
3 ~ 'regular+aid'))
star
# A tibble: 6,325 × 6
race classtype yearssmall hsgrad g4math g4reading
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 regular+aid 0 NA NA NA
2 2 regular+aid 0 NA 706 661
3 1 regular+aid 0 1 711 750
4 2 small 4 NA 672 659
5 1 regular 0 NA NA NA
6 1 regular+aid 0 NA NA NA
7 1 small 4 NA 668 657
8 1 regular+aid 0 NA NA NA
9 1 small 4 1 709 725
10 1 regular 0 1 698 692
# ℹ 6,315 more rows
Recode the race
and hsgrad
variables from star
as indicated above.
Ranked from worst to best:
knitr::kable
, gt
and modelsummary
to generate and format tables, and include them in any document format.datasummary
and knitr::kable
modelsummary
stargazer
. You can use this if you prefer.gt
.datasummary_skim()
histogram = FALSE
datasummary_skim()
dropped all the categorical variables.star
it doesn’t know what do to.N | % | ||
---|---|---|---|
race | Asian | 14 | 0.2 |
Black | 2058 | 32.5 | |
Hispanic | 5 | 0.1 | |
Native American | 2 | 0.0 | |
Other | 9 | 0.1 | |
White | 4234 | 66.9 | |
NA | 3 | 0.0 | |
classtype | regular | 2194 | 34.7 |
regular+aid | 2231 | 35.3 | |
small | 1900 | 30.0 | |
hsgrad | graduate | 2539 | 40.1 |
non-graduate | 508 | 8.0 | |
NA | 3278 | 51.8 |
datasummary_balance()
datasummary_balance( ~ [GROUPING_VAR], data)
datasummary_balance()
regular (N=2194)
|
regular+aid (N=2231)
|
small (N=1900)
|
||||
---|---|---|---|---|---|---|
Mean | Std. Dev. | Mean | Std. Dev. | Mean | Std. Dev. | |
yearssmall | 0.2 | 0.7 | 0.2 | 0.7 | 2.7 | 1.3 |
g4math | 709.5 | 41.0 | 707.6 | 44.7 | 709.2 | 43.6 |
g4reading | 719.9 | 53.2 | 720.7 | 52.4 | 723.4 | 51.5 |
knitr::kable
my_table <- star |>
group_by(classtype) |>
summarize(across(starts_with('g4'), \(x) mean(x, na.rm = TRUE),
.names = '{.col}_avg'))
my_table
# A tibble: 3 × 3
classtype g4math_avg g4reading_avg
<chr> <dbl> <dbl>
1 regular 710. 720.
2 regular+aid 708. 721.
3 small 709. 723.
my_table |>
knitr::kable(digits = 1, caption = 'Average grade 4 test scores',
col.names = c('Class Type', 'Math', 'Reading'))
Class Type | Math | Reading |
---|---|---|
regular | 709.5 | 719.9 |
regular+aid | 707.6 | 720.7 |
small | 709.2 | 723.4 |