Getting Started with dplyr

Francis J. DiTraglia

University of Oxford

Introduction

dplyr - “A grammar of data manipulation”

  • Makes it easy to carry out common data wrangling tasks that can be tedious in base R.
  • Simple, intuitive syntax; pretty darned fast!
  • Main competitor is data.table. Tradeoff is (roughly) speed versus ease of use. But see also dtplyr

dplyr References

A Little Example

employees <- data.frame(
  name = c("Alice", "Bob", "Cathy", "David", "Eva", 
           "Frank", "Grace", "Hank", "Ivy", "Jack"),
  age = c(25, 31, 28, 40, 35, 23, 30, 45, 33, 29),
  department = c("HR", "IT", "Finance", "IT", "HR", 
                 "Finance", "IT", "HR", "Finance", "IT"),
  salary = c(50000, 60000, 55000, 70000, 53000, 
             51000, 62000, 71000, 57000, 59000)
)

employees
    name age department salary
1  Alice  25         HR  50000
2    Bob  31         IT  60000
3  Cathy  28    Finance  55000
4  David  40         IT  70000
5    Eva  35         HR  53000
6  Frank  23    Finance  51000
7  Grace  30         IT  62000
8   Hank  45         HR  71000
9    Ivy  33    Finance  57000
10  Jack  29         IT  59000

Which is easier to understand?

over_30 <- employees[employees$age > 30, ]
over_30_sorted <- over_30[order(over_30$salary, decreasing = TRUE), ]
over_30_sorted[, c("name", "department", "salary")]
   name department salary
8  Hank         HR  71000
4 David         IT  70000
2   Bob         IT  60000
9   Ivy    Finance  57000
5   Eva         HR  53000
library(dplyr)

employees |> 
  filter(age > 30) |> 
  arrange(desc(salary)) |> 
  select(name, department, salary)
   name department salary
1  Hank         HR  71000
2 David         IT  70000
3   Bob         IT  60000
4   Ivy    Finance  57000
5   Eva         HR  53000

🔧 The “Pipe” |>

  • Much dplyr’s elegance comes from its use of “the pipe”
  • Included in Base R since version 4.1
  • If you don’t like typing it out:
    • Tools > Global Options > Code > Editing
    • Select “use native pipe”
    • Control-M to insert |> (that’s a capital M!)
  • Start by using the pipe, then circle back to explain it.

R Packages

  • R packages are add ons that extend the functionality of R.
  • dplyr is a package: not part of base R
  • Install: install.packages('package_name')
  • Load: library(package_name)
  • Install only once (the first time), but load every time you want to use (after restarting R).

💪 Exercise A - (3 min)

  1. Install the dplyr and gapminder packages.
  2. Load the dplyr and gapminder packages.
  3. Read the help file for gapminder and answer the following:
    1. How many rows and columns does gapminder contain?
    2. What information is in each row and column?
    3. What is the source of the data?

What is a tibble?

  • dplyr combines the pipe |> with verbs that act on tibbles.
  • gapminder is a tibble.
  • A tibble is a data frame with some nifty extras.
  • Why “tibble”? Try pronouncing tbl out loud…
  • For now: don’t worry about the details. Let’s experiment!

See how nicely it displays!

library(gapminder)
gapminder
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

dplyr “Verbs”

Filter Rows: filter()

keeps only rows matching a logical condition

gapminder |> 
  filter(year == 2007)
# A tibble: 142 × 6
   country     continent  year lifeExp       pop gdpPercap
   <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
 1 Afghanistan Asia       2007    43.8  31889923      975.
 2 Albania     Europe     2007    76.4   3600523     5937.
 3 Algeria     Africa     2007    72.3  33333216     6223.
 4 Angola      Africa     2007    42.7  12420476     4797.
 5 Argentina   Americas   2007    75.3  40301927    12779.
 6 Australia   Oceania    2007    81.2  20434176    34435.
 7 Austria     Europe     2007    79.8   8199783    36126.
 8 Bahrain     Asia       2007    75.6    708573    29796.
 9 Bangladesh  Asia       2007    64.1 150448339     1391.
10 Belgium     Europe     2007    79.4  10392226    33693.
# ℹ 132 more rows

We haven’t deleted anything!

gapminder
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

To store rather than display

gapminder_2007 <- gapminder |> 
  filter(year == 2007)

gapminder_2007
# A tibble: 142 × 6
   country     continent  year lifeExp       pop gdpPercap
   <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
 1 Afghanistan Asia       2007    43.8  31889923      975.
 2 Albania     Europe     2007    76.4   3600523     5937.
 3 Algeria     Africa     2007    72.3  33333216     6223.
 4 Angola      Africa     2007    42.7  12420476     4797.
 5 Argentina   Americas   2007    75.3  40301927    12779.
 6 Australia   Oceania    2007    81.2  20434176    34435.
 7 Austria     Europe     2007    79.8   8199783    36126.
 8 Bahrain     Asia       2007    75.6    708573    29796.
 9 Bangladesh  Asia       2007    64.1 150448339     1391.
10 Belgium     Europe     2007    79.4  10392226    33693.
# ℹ 132 more rows

Filtering on two or more variables

gapminder |> 
  filter(year == 2007, country == 'United States')
# A tibble: 1 × 6
  country       continent  year lifeExp       pop gdpPercap
  <fct>         <fct>     <int>   <dbl>     <int>     <dbl>
1 United States Americas   2007    78.2 301139947    42952.

💪 Exercise B - (5 minutes)

  1. What is the difference between x = 3 and x == 3 in R?
  2. On the previous slide, I put quotes around United States but not around year. Why?
  3. Use filter to choose the subset of gapminder for 2002. What happens if you replace 2002 with 2005?
  4. Store data for all Asian countries in a tibble called gapminder_asia, then display this tibble.
  5. Which country had the higher life expectancy in 1977: Ireland or Brazil? Which had the higher GDP per capita?

Sort Data: arrange()

order rows (ascending) by the values of one or more columns

gapminder |> 
  arrange(gdpPercap)
# A tibble: 1,704 × 6
   country          continent  year lifeExp      pop gdpPercap
   <fct>            <fct>     <int>   <dbl>    <int>     <dbl>
 1 Congo, Dem. Rep. Africa     2002    45.0 55379852      241.
 2 Congo, Dem. Rep. Africa     2007    46.5 64606759      278.
 3 Lesotho          Africa     1952    42.1   748747      299.
 4 Guinea-Bissau    Africa     1952    32.5   580653      300.
 5 Congo, Dem. Rep. Africa     1997    42.6 47798986      312.
 6 Eritrea          Africa     1952    35.9  1438760      329.
 7 Myanmar          Asia       1952    36.3 20092996      331 
 8 Lesotho          Africa     1957    45.0   813338      336.
 9 Burundi          Africa     1952    39.0  2445618      339.
10 Eritrea          Africa     1957    38.0  1542611      344.
# ℹ 1,694 more rows

Descending order with desc()

gapminder |> 
  arrange(desc(gdpPercap))
# A tibble: 1,704 × 6
   country   continent  year lifeExp     pop gdpPercap
   <fct>     <fct>     <int>   <dbl>   <int>     <dbl>
 1 Kuwait    Asia       1957    58.0  212846   113523.
 2 Kuwait    Asia       1972    67.7  841934   109348.
 3 Kuwait    Asia       1952    55.6  160000   108382.
 4 Kuwait    Asia       1962    60.5  358266    95458.
 5 Kuwait    Asia       1967    64.6  575003    80895.
 6 Kuwait    Asia       1977    69.3 1140357    59265.
 7 Norway    Europe     2007    80.2 4627926    49357.
 8 Kuwait    Asia       2007    77.6 2505559    47307.
 9 Singapore Asia       2007    80.0 4553009    47143.
10 Norway    Europe     2002    79.0 4535591    44684.
# ℹ 1,694 more rows

💪 Exercise C - (2 min)

  1. What is the lowest life expectancy in gapminder? Which country and year does it correspond to?
  2. What is the highest life expectancy in gapminder? Which country and year does it correspond to?

Choose Columns: select()

keep (or drop) columns based on their names and types

gapminder |> 
  select(pop, country, year)
# A tibble: 1,704 × 3
        pop country      year
      <int> <fct>       <int>
 1  8425333 Afghanistan  1952
 2  9240934 Afghanistan  1957
 3 10267083 Afghanistan  1962
 4 11537966 Afghanistan  1967
 5 13079460 Afghanistan  1972
 6 14880372 Afghanistan  1977
 7 12881816 Afghanistan  1982
 8 13867957 Afghanistan  1987
 9 16317921 Afghanistan  1992
10 22227415 Afghanistan  1997
# ℹ 1,694 more rows

Minus sign to drop columns

gapminder |> 
  select(-pop, -year)
# A tibble: 1,704 × 4
   country     continent lifeExp gdpPercap
   <fct>       <fct>       <dbl>     <dbl>
 1 Afghanistan Asia         28.8      779.
 2 Afghanistan Asia         30.3      821.
 3 Afghanistan Asia         32.0      853.
 4 Afghanistan Asia         34.0      836.
 5 Afghanistan Asia         36.1      740.
 6 Afghanistan Asia         38.4      786.
 7 Afghanistan Asia         39.9      978.
 8 Afghanistan Asia         40.8      852.
 9 Afghanistan Asia         41.7      649.
10 Afghanistan Asia         41.8      635.
# ℹ 1,694 more rows

💪 Exercise D - (2 min)

  1. Select only the columns year, lifeExp, and country in gapminder.
  2. Select all the columns except year, lifeExp and country in gapminder.

The summarize() Verb

Compute summary statistics (collapse to a single row)

# Summary statistics for 2007
gapminder_2007 |> 
# syntax: result_name = f(column_name)
  summarize(mean_lifeExp = mean(lifeExp),
            sd_lifeExp = sd(lifeExp),
            mean_pop = mean(pop))
# A tibble: 1 × 3
  mean_lifeExp sd_lifeExp  mean_pop
         <dbl>      <dbl>     <dbl>
1         67.0       12.1 44021220.

Uglier without names; still works

gapminder_2007 |> 
  summarize(mean(lifeExp),
            sd(lifeExp),
            mean(pop))
# A tibble: 1 × 3
  `mean(lifeExp)` `sd(lifeExp)` `mean(pop)`
            <dbl>         <dbl>       <dbl>
1            67.0          12.1   44021220.

💪 Exercise E - (2 min)

  1. Compute the median life expectancy in 1977.
  2. Repeat 1 but restrict the calculation to Asian countries.

Create groups with group_by()

  • What summarize() really does: summarize groups
gapminder |> 
  group_by(year) |> 
  summarize(meanPop = mean(pop) / 1e6, meanLifeExp = mean(lifeExp))
# A tibble: 12 × 3
    year meanPop meanLifeExp
   <int>   <dbl>       <dbl>
 1  1952    17.0        49.1
 2  1957    18.8        51.5
 3  1962    20.4        53.6
 4  1967    22.7        55.7
 5  1972    25.2        57.6
 6  1977    27.7        59.6
 7  1982    30.2        61.5
 8  1987    33.0        63.2
 9  1992    36.0        64.2
10  1997    38.8        65.0
11  2002    41.5        65.7
12  2007    44.0        67.0

Grouping by multiple variables

gapminder |> 
  group_by(year, continent) |> 
  summarize(meanPop = mean(pop) / 1e6, meanLifeExp = mean(lifeExp))
# A tibble: 60 × 4
# Groups:   year [12]
    year continent meanPop meanLifeExp
   <int> <fct>       <dbl>       <dbl>
 1  1952 Africa       4.57        39.1
 2  1952 Americas    13.8         53.3
 3  1952 Asia        42.3         46.3
 4  1952 Europe      13.9         64.4
 5  1952 Oceania      5.34        69.3
 6  1957 Africa       5.09        41.3
 7  1957 Americas    15.5         56.0
 8  1957 Asia        47.4         49.3
 9  1957 Europe      14.6         66.7
10  1957 Oceania      5.97        70.3
# ℹ 50 more rows

💪 Exercise F - (2 min)

  1. Calculate median GDP/capita in each continent in 1977.
  2. Why doesn’t this work as expected? How can you fix it?
gapminder |>
  summarize(meanLifeExp = mean(lifeExp)) |>
  group_by(year)

Modify Columns with mutate()

# Convert population to *millions*
gapminder |> 
  mutate(pop = pop / 1e6) 
# A tibble: 1,704 × 6
   country     continent  year lifeExp   pop gdpPercap
   <fct>       <fct>     <int>   <dbl> <dbl>     <dbl>
 1 Afghanistan Asia       1952    28.8  8.43      779.
 2 Afghanistan Asia       1957    30.3  9.24      821.
 3 Afghanistan Asia       1962    32.0 10.3       853.
 4 Afghanistan Asia       1967    34.0 11.5       836.
 5 Afghanistan Asia       1972    36.1 13.1       740.
 6 Afghanistan Asia       1977    38.4 14.9       786.
 7 Afghanistan Asia       1982    39.9 12.9       978.
 8 Afghanistan Asia       1987    40.8 13.9       852.
 9 Afghanistan Asia       1992    41.7 16.3       649.
10 Afghanistan Asia       1997    41.8 22.2       635.
# ℹ 1,694 more rows

Create Columns with mutate()

# Calculate GDP from GDP / capita and population
gapminder |> 
  mutate(GDP = pop * gdpPercap)
# A tibble: 1,704 × 7
   country     continent  year lifeExp      pop gdpPercap          GDP
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>        <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.  6567086330.
 2 Afghanistan Asia       1957    30.3  9240934      821.  7585448670.
 3 Afghanistan Asia       1962    32.0 10267083      853.  8758855797.
 4 Afghanistan Asia       1967    34.0 11537966      836.  9648014150.
 5 Afghanistan Asia       1972    36.1 13079460      740.  9678553274.
 6 Afghanistan Asia       1977    38.4 14880372      786. 11697659231.
 7 Afghanistan Asia       1982    39.9 12881816      978. 12598563401.
 8 Afghanistan Asia       1987    40.8 13867957      852. 11820990309.
 9 Afghanistan Asia       1992    41.7 16317921      649. 10595901589.
10 Afghanistan Asia       1997    41.8 22227415      635. 14121995875.
# ℹ 1,694 more rows

💪 Exercise G - (2min)

  1. Why did I use = rather than == in the mutate() examples from the preceding two slides?
  2. Convert life expectancy from years to months.

Understanding the Pipe

These are exactly equivalent:

# Without the pipe 
filter(gapminder, year == 2007, country == 'United States')
# A tibble: 1 × 6
  country       continent  year lifeExp       pop gdpPercap
  <fct>         <fct>     <int>   <dbl>     <int>     <dbl>
1 United States Americas   2007    78.2 301139947    42952.
# With the pipe
gapminder |> 
  filter(year == 2007, country == 'United States')
# A tibble: 1 × 6
  country       continent  year lifeExp       pop gdpPercap
  <fct>         <fct>     <int>   <dbl>     <int>     <dbl>
1 United States Americas   2007    78.2 301139947    42952.

|> supplies 1st function argument

c(1, 5, 2, 7, 2) |> 
  mean() # empty parentheses: no other arguments
[1] 3.4
c(1, 5, 2, 7, 2, NA) |> 
  mean(na.rm = TRUE) # other function arguments
[1] 3.4

_ supplies other (named) argument

TRUE |> 
  mean(c(1, 5, 2, 7, 2, NA), na.rm = _)
[1] 3.4

💪 Exercise H - (2 min)

  1. Use |> to calculate the sample variance of c(4, 1, 5, NA, 3), excluding any missing values.
  2. Repeat the preceding using both |> and _.
  3. Sort gapminder in descending order by lifeExp without using |> or _.

Chaining Commands

The real power of |> and dplyr comes from chaining

gapminder |> 
  filter(year == 1982) |> 
  group_by(continent) |> 
  summarize(median(lifeExp))
# A tibble: 5 × 2
  continent `median(lifeExp)`
  <fct>                 <dbl>
1 Africa                 50.8
2 Americas               67.4
3 Asia                   63.7
4 Europe                 73.5
5 Oceania                74.3

💪 Exercise I - (5 min)

Write a single pipeline that calculates the mean and standard deviation of GDP/capita by continent and year for all years after 1997, and sorts the results in ascending order by the standard deviation.

To pipe or not to pipe?

# This is crying out for a pipeline! 
dat <- h(dat, arg1)
dat <- g(dat, arg2)
dat <- f(dat, arg3)

# Much nicer
dat |> 
  h(arg1) |> 
  g(arg2) |> 
  f(arg3) 

# This is how it is actually evaluated
f(g(h(dat, arg1), arg2), arg3)

To pipe or not to pipe?

  • Avoid really long pipelines: say max of 10 steps.
  • Group conceptually related steps together.
  • Pipelines are supposed to make code easier to understand.

Break these rules sooner than say anything outright barbarous. -George Orwell

|> versus %>%

  • Older dplyr code, including core ERM 2022, has %>%
  • %>% is from a package called magrittr
  • |> is “a little faster” and “a little simpler and more robust”
  • Read this if you want to understand the subtleties
  • I will always use |> and so should you.

📚 More to learn!

  • We’ll learn more dplyr throughout the term.
  • But as mentioned in our first lecture, I can’t possibly cover everything you need to know.
  • You should also do some more reading on your own now.
  • See the references at the start of this slideshow.