Introduction to dplyr - Solutions

Exercise A - (3 min)

  1. Install the dplyr and gapminder packages.
  2. Load the dplyr and gapminder packages.
  3. Read the help file for gapminder and answer the following:
    1. How many rows and columns does gapminder contain?
    2. What information is in each row and column?
    3. What is the source of the data?

Solution

  1. Run the following:
install.packages('dplyr')
install.packages('gapminder')
  1. Run the following:
library(dplyr)
library(gapminder)
  1. Consult ?gapminder.

Exercise B - (5 minutes)

  1. What is the difference between x = 3 and x == 3 in R?
  2. On the previous slide, I put quotes around United States but not around year. Why?
  3. Use filter to choose the subset of gapminder for 2002. What happens if you replace 2002 with 2005?
  4. Store data for all Asian countries in a tibble called gapminder_asia, then display this tibble.
  5. Which country had the higher life expectancy in 1977: Ireland or Brazil? Which had the higher GDP per capita?

Solution

  1. The first assigns the value 3 to x; the second tests whether or not x is equal to 3 and returns TRUE or FALSE
  2. Because year contains numeric data while country contains character data.
  3. If you go back to the help file for gapminder you’ll see that it only contains data for every fifth year. The year 2005 isn’t in our so dplyr displayes an empty tibble in the second case:
gapminder |> 
  filter(year == 2002)
# A tibble: 142 × 6
   country     continent  year lifeExp       pop gdpPercap
   <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
 1 Afghanistan Asia       2002    42.1  25268405      727.
 2 Albania     Europe     2002    75.7   3508512     4604.
 3 Algeria     Africa     2002    71.0  31287142     5288.
 4 Angola      Africa     2002    41.0  10866106     2773.
 5 Argentina   Americas   2002    74.3  38331121     8798.
 6 Australia   Oceania    2002    80.4  19546792    30688.
 7 Austria     Europe     2002    79.0   8148312    32418.
 8 Bahrain     Asia       2002    74.8    656397    23404.
 9 Bangladesh  Asia       2002    62.0 135656790     1136.
10 Belgium     Europe     2002    78.3  10311970    30486.
# ℹ 132 more rows
gapminder |> 
  filter(year == 2005)
# A tibble: 0 × 6
# ℹ 6 variables: country <fct>, continent <fct>, year <int>, lifeExp <dbl>,
#   pop <int>, gdpPercap <dbl>
  1. Run the following:
gapminder_asia <- gapminder |>  
  filter(continent == 'Asia')
gapminder_asia
# A tibble: 396 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 386 more rows
  1. Ireland had the higher value of both:
gapminder |>  
  filter(year == 1977, country %in% c('Ireland', 'Brazil'))
# A tibble: 2 × 6
  country continent  year lifeExp       pop gdpPercap
  <fct>   <fct>     <int>   <dbl>     <int>     <dbl>
1 Brazil  Americas   1977    61.5 114313951     6660.
2 Ireland Europe     1977    72.0   3271900    11151.

Exercise C - (2 min)

  1. What is the lowest life expectancy in gapminder? Which country and year does it correspond to?
  2. What is the highest life expectancy in gapminder? Which country and year does it correspond to?

Solution

  1. The lowest life expectancy was in Rwanda in 1992:
gapminder |>  
  arrange(lifeExp)
# A tibble: 1,704 × 6
   country      continent  year lifeExp     pop gdpPercap
   <fct>        <fct>     <int>   <dbl>   <int>     <dbl>
 1 Rwanda       Africa     1992    23.6 7290203      737.
 2 Afghanistan  Asia       1952    28.8 8425333      779.
 3 Gambia       Africa     1952    30    284320      485.
 4 Angola       Africa     1952    30.0 4232095     3521.
 5 Sierra Leone Africa     1952    30.3 2143249      880.
 6 Afghanistan  Asia       1957    30.3 9240934      821.
 7 Cambodia     Asia       1977    31.2 6978607      525.
 8 Mozambique   Africa     1952    31.3 6446316      469.
 9 Sierra Leone Africa     1957    31.6 2295678     1004.
10 Burkina Faso Africa     1952    32.0 4469979      543.
# ℹ 1,694 more rows
  1. The highest life expectancy was in Japan in 2007:
gapminder |>  
  arrange(desc(lifeExp))
# A tibble: 1,704 × 6
   country          continent  year lifeExp       pop gdpPercap
   <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
 1 Japan            Asia       2007    82.6 127467972    31656.
 2 Hong Kong, China Asia       2007    82.2   6980412    39725.
 3 Japan            Asia       2002    82   127065841    28605.
 4 Iceland          Europe     2007    81.8    301931    36181.
 5 Switzerland      Europe     2007    81.7   7554661    37506.
 6 Hong Kong, China Asia       2002    81.5   6762476    30209.
 7 Australia        Oceania    2007    81.2  20434176    34435.
 8 Spain            Europe     2007    80.9  40448191    28821.
 9 Sweden           Europe     2007    80.9   9031088    33860.
10 Israel           Asia       2007    80.7   6426679    25523.
# ℹ 1,694 more rows

Exercise D - (2 min)

  1. Select only the columns year, lifeExp, and country in gapminder.
  2. Select all the columns except year, lifeExp and country in gapminder.

Solution

# Part 1
gapminder |>  
  select(year, lifeExp, country)
# A tibble: 1,704 × 3
    year lifeExp country    
   <int>   <dbl> <fct>      
 1  1952    28.8 Afghanistan
 2  1957    30.3 Afghanistan
 3  1962    32.0 Afghanistan
 4  1967    34.0 Afghanistan
 5  1972    36.1 Afghanistan
 6  1977    38.4 Afghanistan
 7  1982    39.9 Afghanistan
 8  1987    40.8 Afghanistan
 9  1992    41.7 Afghanistan
10  1997    41.8 Afghanistan
# ℹ 1,694 more rows
# Part 2
gapminder |>  
  select(-year, -lifeExp, -country)
# A tibble: 1,704 × 3
   continent      pop gdpPercap
   <fct>        <int>     <dbl>
 1 Asia       8425333      779.
 2 Asia       9240934      821.
 3 Asia      10267083      853.
 4 Asia      11537966      836.
 5 Asia      13079460      740.
 6 Asia      14880372      786.
 7 Asia      12881816      978.
 8 Asia      13867957      852.
 9 Asia      16317921      649.
10 Asia      22227415      635.
# ℹ 1,694 more rows

Exercise E - (2 min)

  1. Compute the median life expectancy in 1977.
  2. Repeat 1 but restrict the calculation to Asian countries.

Solution

# Part 1
gapminder |> 
  filter(year == 1977) |> 
  summarize(median(lifeExp))
# A tibble: 1 × 1
  `median(lifeExp)`
              <dbl>
1              59.7
# Part 2
gapminder |> 
  filter(year == 1977, continent == 'Asia') |> 
  summarize(median(lifeExp))
# A tibble: 1 × 1
  `median(lifeExp)`
              <dbl>
1              60.8

Exercise F - (2 min)

  1. Calculate median GDP/capita in each continent in 1977.
  2. Why doesn’t this work as expected? How can you fix it?
gapminder |>
  summarize(meanLifeExp = mean(lifeExp)) |>
  group_by(year)

Solution

  1. Run the following:
gapminder |> 
  group_by(continent) |> 
  summarize(median(lifeExp))
# A tibble: 5 × 2
  continent `median(lifeExp)`
  <fct>                 <dbl>
1 Africa                 47.8
2 Americas               67.0
3 Asia                   61.8
4 Europe                 72.2
5 Oceania                73.7
  1. Here the problem is that group_by() comes after summarize(), but once we’ve summarized by computing the mean life expectancy, we’ve already “collapsed” all the years. The code works as expected if we reverse the order:
gapminder |>
  group_by(year) |> 
  summarize(meanLifeExp = mean(lifeExp)) 
# A tibble: 12 × 2
    year meanLifeExp
   <int>       <dbl>
 1  1952        49.1
 2  1957        51.5
 3  1962        53.6
 4  1967        55.7
 5  1972        57.6
 6  1977        59.6
 7  1982        61.5
 8  1987        63.2
 9  1992        64.2
10  1997        65.0
11  2002        65.7
12  2007        67.0

Exercise G - (2min)

  1. Why did I use = rather than == in the mutate() examples from the preceding two slides?
  2. Convert life expectancy from years to months.

Solution

  1. This is because we are carrying out an assignment operation. In contrast, == tests for equality, returning TRUE or `FALSE.
  2. Run the following:
gapminder |> 
  mutate(lifeExpMonths = 12 * lifeExp)
# A tibble: 1,704 × 7
   country     continent  year lifeExp      pop gdpPercap lifeExpMonths
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>         <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.          346.
 2 Afghanistan Asia       1957    30.3  9240934      821.          364.
 3 Afghanistan Asia       1962    32.0 10267083      853.          384.
 4 Afghanistan Asia       1967    34.0 11537966      836.          408.
 5 Afghanistan Asia       1972    36.1 13079460      740.          433.
 6 Afghanistan Asia       1977    38.4 14880372      786.          461.
 7 Afghanistan Asia       1982    39.9 12881816      978.          478.
 8 Afghanistan Asia       1987    40.8 13867957      852.          490.
 9 Afghanistan Asia       1992    41.7 16317921      649.          500.
10 Afghanistan Asia       1997    41.8 22227415      635.          501.
# ℹ 1,694 more rows

Exercise H - (2 min)

  1. Use |> to calculate the sample variance of c(4, 1, 5, NA, 3), excluding any missing values.
  2. Repeat the preceding using both |> and _.
  3. Sort gapminder in descending order by lifeExp without using |> or _.

Solution

# Part 1
c(4, 1, 5, NA, 3) |> 
  var(na.rm = TRUE)
[1] 2.916667
# Part 2
TRUE |> 
  var(c(4, 1, 5, NA, 3), na.rm = _)
[1] 2.916667
# Part 3
arrange(gapminder, desc(lifeExp))
# A tibble: 1,704 × 6
   country          continent  year lifeExp       pop gdpPercap
   <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
 1 Japan            Asia       2007    82.6 127467972    31656.
 2 Hong Kong, China Asia       2007    82.2   6980412    39725.
 3 Japan            Asia       2002    82   127065841    28605.
 4 Iceland          Europe     2007    81.8    301931    36181.
 5 Switzerland      Europe     2007    81.7   7554661    37506.
 6 Hong Kong, China Asia       2002    81.5   6762476    30209.
 7 Australia        Oceania    2007    81.2  20434176    34435.
 8 Spain            Europe     2007    80.9  40448191    28821.
 9 Sweden           Europe     2007    80.9   9031088    33860.
10 Israel           Asia       2007    80.7   6426679    25523.
# ℹ 1,694 more rows

Exercise I - (5 min)

Write a single pipeline that calculates the mean and standard deviation of GDP/capita by continent and year for all years after 1997, and sorts the results in ascending order by the standard deviation.

Solution

gapminder |> 
  filter(year > 1997) |> 
  group_by(continent, year) |> 
  summarize(mean_GDPc = mean(gdpPercap), sd_GDPc = sd(gdpPercap)) |> 
  arrange(sd_GDPc)
# A tibble: 10 × 4
# Groups:   continent [5]
   continent  year mean_GDPc sd_GDPc
   <fct>     <int>     <dbl>   <dbl>
 1 Africa     2002     2599.   2973.
 2 Africa     2007     3089.   3618.
 3 Oceania    2002    26939.   5302.
 4 Oceania    2007    29810.   6541.
 5 Americas   2002     9288.   8896.
 6 Americas   2007    11003.   9713.
 7 Asia       2002    10174.  11151.
 8 Europe     2002    21712.  11197.
 9 Europe     2007    25054.  11800.
10 Asia       2007    12473.  14155.