Exercise - DAGs and Bad Controls

Author

Johanna Barop and Frank DiTraglia

Published

May 14, 2026

Overview

As we saw in the lecture on “Bad Controls”, Directed Acyclic Graphs (DAGs) are powerful tools for visualizing and analyzing causal relationships. They help us determine which variables we should control for when estimating causal effects from observational data.

In this exercise, we’ll work with DAGs to:

Build and visualize causal models
Identify different types of paths between variables
Determine appropriate adjustment sets
Understand when controlling for a variable can help or harm causal inference

DAGs in `R`

Required Packages

We’ll use two main packages:

ggdag: For creating and visualizing DAGs
dagitty: For analyzing paths and determining adjustment sets

# Load required packages
library(ggdag)
library(dagitty)
library(tidyverse)

Building Your First DAG

Let’s start by building a simple DAG with four variables:

\(D\): Treatment variable
\(Y\): Outcome variable
\(X\): Mediator/confounder
\(Z\): Another variable in our causal system

The dagify() function allows us to specify the relationships between variables using formulas. Each formula represents the direct causes of a variable.

# Build a DAG
my_dag <- dagify(
  Y ~ X + D + Z, # Y is caused by X, D, and Z
  X ~ Z, # X is caused by Z
  D ~ X
) # D is caused by X

# Print the DAG structure
my_dag

dag {
D
X
Y
Z
D -> Y
X -> D
X -> Y
Z -> X
Z -> Y
}

Visualizing DAGs

The ggdag() function visualizes our DAG using ggplot2.

# Basic plot
my_dag |> ggdag()

This basic plot is functional but not very elegant. Let’s improve it!

# Improved formatting
my_dag |>
  ggdag(node_size = 15, text_size = 8) + # adjust size
  theme_dag() # remove axes etc.

The layout of our DAG is determined automatically. However, we can specify the exact coordinates for each node to create a more intuitive layout.

# Create DAG with custom coordinates
my_dag <- dagify(Y ~ X + D + Z,
  X ~ Z,
  D ~ X,
  # Specify Cartesian coordinates of each node
  coords = list(
    x = c(Z = 2, X = 1, D = 1, Y = 2),
    y = c(Z = 2, X = 2, D = 1, Y = 1)
  )
)

# Plot with custom layout
my_dag |>
  ggdag(node_size = 15, text_size = 8) +
  theme_dag()

Analyzing Paths in DAGs

A key step in causal analysis is identifying paths between variables. The paths() function helps us list all paths between two variables.

# List all paths between D and Y
paths(my_dag, from = "D", to = "Y")

$paths
[1] "D -> Y"           "D <- X -> Y"      "D <- X <- Z -> Y"

$open
[1] TRUE TRUE TRUE

Not all paths are causal paths. To list only directed (causal) paths from \(D\) to \(Y\):

# List only directed (causal) paths
paths(my_dag, from = "D", to = "Y", directed = TRUE)

$paths
[1] "D -> Y"

$open
[1] TRUE

Finding Adjustment Sets

To estimate causal effects, we need to block all backdoor paths between our treatment and outcome. The adjustmentSets() function automatically identifies variables that should be controlled for.

# What to adjust for to learn the D -> Y effect?
my_dag |>
  adjustmentSets(exposure = "D", outcome = "Y")

{ X }

# The X -> Y effect?
my_dag |>
  adjustmentSets(exposure = "X", outcome = "Y")

{ Z }

# The Z -> Y effect?
my_dag |>
  adjustmentSets(exposure = "Z", outcome = "Y")

{}

Now it’s your turn!

Exercises

Exercise 1: Effect of Exercise on Cancer

Consider the following DAG about exercise and cancer:

Where:

\(D\): Physical activity
\(Y\): Cervical cancer (Yes/No)
\(X\): Positive pap smear test result (Yes/No)
\(U\): Pre-cancer lesion (Yes/No) - unobserved
\(V\): Health-consciousness - unobserved

Story: Health-conscious people tend to be more physically active and visit doctors more frequently, increasing the chance of a pap smear test.

Questions

Plot the DAG.
List all paths between \(D\) and \(Y\). Hint: You can use paths() from R for this!
Are there any backdoor paths between \(D\) and \(Y\)?
Should we adjust for \(X\) when estimating the effect of physical activity (\(D\)) on cervical cancer (\(Y\))? Why or why not?

Solutions

Show the solution for Question 1

exercise_dag <- dagify(
  Y ~ D + U,
  D ~ V,
  X ~ U + V,
  coords = list(
    x = c(U = 1, X = 1, V = 1, D = 2, Y = 3),
    y = c(U = 3, X = 2, V = 1, D = 2, Y = 2)
  )
)

exercise_dag |>
  ggdag(node_size = 15, text_size = 8) +
  theme_dag()

Show the solution for Question 2

# 2. List all paths between D and Y
paths(exercise_dag, from = "D", to = "Y")

Show the solution for Question 3

# The path D <- V -> X <- U -> Y is a backdoor path: it starts with an arrow
# pointing into D

Show the solution for Question 4

exercise_dag |>
  adjustmentSets(exposure = "D", outcome = "Y")

# Analysis:
# X is a collider on the backdoor path D <- V -> X <- U -> Y
# This path is already blocked because of the collider X
# If we condition on X, we would open this path, creating bias
# Therefore, we should NOT control for X when estimating the effect of physical
# activity on cancer

Exercise 2: Building Your Own DAG

Consider a study on the effect of education (\(E\)) on income (\(I\)). Here are some variables that might be involved:

\(E\): Education level
\(I\): Income
\(A\): Ability (unobserved)
\(S\): Socioeconomic status of parents
\(G\): Gender
\(L\): Location (urban/rural)

Questions

Build a DAG representing your beliefs about the causal relationships between these variables.
Visualize your DAG.
Determine what variables you should control for to estimate the causal effect of education on income.
How would your adjustment strategy change if some variables were unobserved?

Solutions

Show the solution

# One possible DAG for the education-income example
# Your solution will be different if you specify a different DAG here!
education_dag <- dagify(
  I ~ E + A + G + L, # Income caused by education, ability, gender, and location
  E ~ A + S + G + L, # Education caused by ability, SES, gender, and location
  coords = list(
    x = c(E = 2, I = 3, A = 1, S = 1, G = 2, L = 3),
    y = c(E = 2, I = 2, A = 3, S = 1, G = 3, L = 1)
  )
)

# Visualize the DAG
education_dag |>
  ggdag(node_size = 15, text_size = 8) +
  theme_dag()

# Find adjustment sets
education_dag |>
  adjustmentSets(exposure = "E", outcome = "I")

# If A is unobserved
education_dag_unobserved <- dagify(
  I ~ E + A + G + L,
  E ~ A + S + G + L,
  latent = "A", # Specify A as unobserved
  coords = list(
    x = c(E = 2, I = 3, A = 1, S = 1, G = 2, L = 3),
    y = c(E = 2, I = 2, A = 3, S = 1, G = 3, L = 1)
  )
)

# Find adjustment sets with unobserved A
education_dag_unobserved |>
  adjustmentSets(exposure = "E", outcome = "I")

# In this case, because A is unobserved and it creates a backdoor path,
# we cannot identify the causal effect of education on income.

Overview

DAGs in R

Required Packages

Building Your First DAG

Visualizing DAGs

Analyzing Paths in DAGs

Finding Adjustment Sets

Exercises

Exercise 1: Effect of Exercise on Cancer

Questions

Solutions

Exercise 2: Building Your Own DAG

Questions

Solutions

DAGs in `R`