Getting Started with ggplot2

Francis J. DiTraglia

University of Oxford

ggplot2 - A Grammar of Graphics

  • Base R has simple plotting functions that allow fine control of your results but are tedious and cumbersome to use.
  • ggplot2 makes it easy to create beautiful graphics with minimal effort.
  • Why is it called ggplot2?
  • ggplot2 and dplyr are part of the tidyverse
    • install.packages('tidyverse')
    • library(tidyverse)

ggplot2 references

Don’t panic!

  • ggplot2 syntax takes a bit of getting used to
  • You are not simply telling R where to point points
  • You are explaining the structure of your graphics
  • A bit of a learning curve, but it’s worthwhile
  • “Pipline-like” syntax but with + rather than |>

Recall the gapminder dataset

library(tidyverse) # includes dplyr
library(gapminder)
gapminder_2007 <- gapminder |> 
  filter(year == 2007)

A Basic Scatterplot

ggplot(data = gapminder_2007,
       mapping = aes(x = gdpPercap, y = lifeExp)) + 
  geom_point()

Combining + with |>

gapminder_2007 |> 
  ggplot(aes(x = gdpPercap, y = lifeExp)) + 
  geom_point()

💪 Exercise A - (2 min)

  1. Using the preceding slide as a template, make a scatterplot with pop on the x-axis and lifeExp on the y-axis, based on gapminder_2007.
  2. Repeat the preceding but with gdpPercap on the y-axis.

Plotting on the Log Scale

gapminder_2007 |> 
  ggplot(aes(x = gdpPercap, y = lifeExp)) + 
  geom_point() + 
  scale_x_log10()

Building up a plot step-by-step

myplot <- gapminder_2007 |> 
  ggplot(aes(x = gdpPercap, y = lifeExp)) +
  geom_point()

myplot +
  scale_x_log10()

Titles and Axis Labels

myplot +
  labs(title = 'This is the title',
       subtitle = 'This is the subtitle',
       caption = 'This is the caption') + 
  xlab('GDP / capita ($US, inflation-adjusted)') +
  ylab('Life Expectancy (years)')

💪 Exercise B - (5 min)

Label your axes and give each plot a title!

  1. Make a scatterplot with the log base 10 of pop on the x-axis and lifeExp on the y-axis using gapminder_2007.
  2. Figure out how to make a plot with the y-axis on the log scale. Then repeat my plot from the previous slide with gdpPercap in levels and lifeExp in logs.
  3. Repeat 2 but with both axes on the log scale.

ggplot2 Syntax Basics

  • Combine data with aesthetic mapping and geom
  • mapping = aes(x = gdpPerCap, y = lifeExp)
    • aes is short for aesthetic
    • maps gdpPerCap to the x-coordinate
    • maps lifeExp to the y-coordinate
  • geom_point() is a geometric object, geom for short
    • Uses the mapping to create a scatterplot
  • We will learn more aesthetic mappings and more geoms!

The color Aesthetic

gapminder_2007 |> 
  ggplot(aes(x = gdpPercap, y = lifeExp, 
             color = continent)) + 
  geom_point() +
  scale_x_log10()

The size Aesthetic

gapminder_2007 |> 
  ggplot(aes(x = gdpPercap, y = lifeExp, 
             color = continent,
             size = pop)) + 
  geom_point() +
  scale_x_log10()

💪 Exercise C - (2 min)

  1. Would it make sense to set size = continent? What about setting col = pop?

  2. Using gapminder data from 1952, plot life expectancy on the y-axis and log population on the x-axis. Color the points by continent.

Faceting: Plots for Multiple Subsets

gapminder |> 
  filter(year %in% c(1952, 1972, 1992)) |> 
  ggplot(aes(x = gdpPercap, y = lifeExp, 
             color = continent, size = pop)) +
  geom_point() +
  scale_x_log10() +
  facet_wrap(~ year)

💪 Exercise - D (3 min)

  1. Make a scatterplot of gapminder data from 1997. Facet by continent and put GDP/capita on the log scale on the x-axis and life expectancy on the y-axis. Indicate population by the size of each point.
  2. What do you think would happen if we had tried to facet by pop rather than year? Why?

Plotting Summarized Data

gapminder |> 
  mutate(popMil = pop / 1000000) |> 
  group_by(year, continent) |>  
  summarize(totalpopMil = sum(popMil)) |> 
  ggplot(aes(x = year, y = totalpopMil, color = continent)) +
  geom_point() + 
  ylab('Total Population (Millions)')

Line Plots with geom_line()

gapminder |> 
  mutate(popMil = pop / 1000000) |> 
  group_by(year, continent) |>  
  summarize(totalpopMil = sum(popMil)) |> 
  ggplot(aes(x = year, y = totalpopMil, color = continent)) +
  geom_line() + # <---- This is the only thing that changed!
  ylab('Total Population (Millions)')

💪 Exercise E - (3 min)

  1. Try appending expand_limits(y = 0) to the previous plot. What happens? Why and when might this be helpful?
  2. Make a scatterplot with average GPD/capita across all countries contained in gapminder on the y-axis and year on the x-axis.
  3. Repeat the preceding, broken down by continent, using color to distinguish the points. Put mean GPD/capita on the log scale.
  4. Modify the last plot to include both points and lines.

Histogram: geom_histogram()

  • Count / fraction of observations in equally spaced bins
  • Notice that we only need to specify x. Why?
gapminder_2007 |> 
ggplot(aes(x = lifeExp)) +
  geom_histogram(binwidth = 5) 

Changing the binwidth

gapminder_2007 |> 
ggplot(aes(x = lifeExp)) +
  geom_histogram(binwidth = 1) 

💪 Exercise F - (3 min)

  1. What happens if you don’t specify a binwidth? Try it and find out!
  2. Make a histogram of GDP/capita across countries in 1977. Play around with different binwidths until you find one that gives a good summary of the data.
  3. Repeat the preceding but with GDP/capita on the log scale. Compare and contrast.

Boxplot: geom_boxplot()

  • Whiskers: max and min; Box: middle 50% of data
  • Lines: 25th percentile, median, 75th percentile
  • Dots: observations \(>(1.5 \times \text{IQR})\) from box
gapminder_2007 |> 
  ggplot(aes(y = lifeExp)) +
  geom_boxplot()

Multiple Boxplots

  • x for the groups; y for the variable that is summarized
gapminder_2007 |> 
  ggplot(aes(x = continent, y = lifeExp)) +
  geom_boxplot()

💪 Exercise G - (2 min)

Use faceting to construct a collection of boxplots, each of which compares log GDP/capita across continents in a given year.

Bar Plots with geom_col()

The x argument of aes must be categorical

by_continent <- gapminder |>  
  filter(year == 2007) |> 
  group_by(continent) |> 
  summarize(meanLifeExp = mean(lifeExp))

by_continent |> 
ggplot(aes(x = continent, y = meanLifeExp)) +
  geom_col()

Rotate a plot with coord_flip()

by_continent |> 
ggplot(aes(x = continent, y = meanLifeExp)) +
  geom_col() +
  coord_flip()

A different theme

by_continent |> 
ggplot(aes(x = continent, y = meanLifeExp)) +
  geom_col() +
  coord_flip() +
  theme_bw() # <------ see ?theme() for much more!

💪 Exercise H - (3 min)

  1. Go back and turn your boxplots from the last exercise sideways to make it easier to read the continent labels.

  2. Make a collection of bar plots faceted by year that compare mean GDP per capita across countries in a given year. Orient your plots so it’s easy to read the continent labels.

Cleveland Dot Charts

Bar charts are inferior to Cleveland dot charts

by_continent |> 
  ggplot(aes(x = meanLifeExp, y = continent)) + 
  theme_bw() +
  geom_point()  # <---- we've used this before!

Sort by meanLifeExp

See https://forcats.tidyverse.org/ for fct_reorder()

by_continent |> 
  mutate(continent = fct_reorder(continent, meanLifeExp)) |> 
  ggplot(aes(x = meanLifeExp, y = continent)) + 
  theme_bw() +
  geom_point()  

A more complicated dot chart

gapminder |> 
  filter(year %in% c(1987, 2007)) |> 
  mutate(year = factor(year)) |> 
  group_by(continent, year) |> 
  summarize(meanLifeExp = mean(lifeExp)) |> 
  ggplot(aes(x = meanLifeExp, y = continent)) +
  geom_line(aes(group = continent)) +
  geom_point(aes(color = year)) +
  xlab('Average Life Expectancy') +
  ylab('Continent') +
  theme_bw()

A more complicated dot chart

💪 Exercise I - (3 min)

Make a dot chart of GDP per capita in all European countries in the year 2007. Sort the dots so that the country with the highest GDP per capita appears a the top and the country with the lowest appears at the bottom.

Exporting with ggsave()

  • ggsave(filename, width = , height = )
  • Defaults to saving the last plot you made
  • Can also specify plot = SOMETHING
  • filename is the path and file name
  • Supports .jpg, .png, .pdf and many others

📚 More to learn!

  • We’ll learn more ggplot2 later this term.
  • As always, I can’t cover everything in class.
  • You will need to do some reading on your own.
  • See the references at the start of this slideshow.