Exercise - DAGs and Bad Controls

Author

Core Empirical Research Methods

Published

May 12, 2025

Overview

As we saw in the lecture on “Bad Controls”, Directed Acyclic Graphs (DAGs) are powerful tools for visualizing and analyzing causal relationships. They help us determine which variables we should control for when estimating causal effects from observational data.

In this exercise, we’ll work with DAGs to:

  1. Build and visualize causal models
  2. Identify different types of paths between variables
  3. Determine appropriate adjustment sets
  4. Understand when controlling for a variable can help or harm causal inference

Required Packages

We’ll use two main packages:

  • ggdag: For creating and visualizing DAGs
  • dagitty: For analyzing paths and determining adjustment sets
# Load required packages
library(ggdag)
library(dagitty)
library(tidyverse) # For data manipulation and visualization

Question 1: Warm-up

  1. Find all paths between X and Y.
  2. Find all backdoor paths between X and Y.
  3. According to the back door criterion, which variables should we adjust for to estimate the causal effect of X on Y?

Building Your First DAG

Let’s start by building a simple DAG with four variables:

  • D: Treatment variable
  • Y: Outcome variable
  • X: Mediator/confounder
  • Z: Another variable in our causal system

The dagify() function allows us to specify the relationships between variables using formulas. Each formula represents the direct causes of a variable.

# Build a DAG
myDAG <- dagify(Y ~ X + D + Z,  # Y is caused by X, D, and Z
                X ~ Z,           # X is caused by Z
                D ~ X)           # D is caused by X

# Print the DAG structure
myDAG
dag {
D
X
Y
Z
D -> Y
X -> D
X -> Y
Z -> X
Z -> Y
}

Visualizing DAGs

The ggdag() function visualizes our DAG using ggplot2.

# Basic plot
myDAG |> ggdag() 

This basic plot is functional but not very elegant. Let’s improve it!

# Improved formatting
myDAG |> 
  ggdag(node_size = 15, text_size = 8) + # adjust size
  theme_dag()  # remove axes etc.

The layout of our DAG is determined automatically. However, we can specify the exact coordinates for each node to create a more intuitive layout.

# Create DAG with custom coordinates
myDAG <- dagify(Y ~ X + D + Z,
                X ~ Z,  
                D ~ X,
  # Specify Cartesian coordinates of each node 
  coords = list(x = c(Z = 2, X = 1, D = 1, Y = 2),
                y = c(Z = 2, X = 2, D = 1, Y = 1))
)

# Plot with custom layout
myDAG |> 
  ggdag(node_size = 15, text_size = 8) + 
  theme_dag()

Analyzing Paths in DAGs

A key step in causal analysis is identifying paths between variables. The paths() function helps us list all paths between two variables.

# List all paths between D and Y
paths(myDAG, from = 'D', to = 'Y')
$paths
[1] "D -> Y"           "D <- X -> Y"      "D <- X <- Z -> Y"

$open
[1] TRUE TRUE TRUE

Not all paths are causal paths. To list only directed (causal) paths from D to Y:

# List only directed (causal) paths
paths(myDAG, from = 'D', to = 'Y', directed = TRUE)
$paths
[1] "D -> Y"

$open
[1] TRUE

Finding Adjustment Sets

To estimate causal effects, we need to block all backdoor paths between our treatment and outcome. The adjustmentSets() function automatically identifies variables that should be controlled for.

# What to adjust for to learn the D -> Y effect?
myDAG |> 
   adjustmentSets(exposure = 'D', outcome = 'Y')
{ X }
# The X -> Y effect?
myDAG |> 
   adjustmentSets(exposure = 'X', outcome = 'Y')
{ Z }
# The Z -> Y effect?
myDAG |> 
   adjustmentSets(exposure = 'Z', outcome = 'Y')
 {}

Question 2: Effect of Exercise on Cancer

Consider the following DAG about exercise and cancer:

Where:

  • D: Physical activity
  • Y: Cervical cancer (Yes/No)
  • X: Positive pap smear test result (Yes/No)
  • U: Pre-cancer lesion (Yes/No) - unobserved
  • V: Health-consciousness - unobserved

Story: Health-conscious people tend to be more physically active and visit doctors more frequently, increasing the chance of a pap smear test.

Questions:

  1. Plot the DAG.
  2. List all paths between D and Y.
  3. Are there any backdoor paths between D and Y?
  4. Should we adjust for X when estimating the effect of physical activity (D) on cervical cancer (Y)? Why or why not?

Question 3: Building Your Own DAG

Consider a study on the effect of education (E) on income (I). Here are some variables that might be involved:

  • E: Education level
  • I: Income
  • A: Ability (unobserved)
  • S: Socioeconomic status of parents
  • G: Gender
  • L: Location (urban/rural)

Questions:

  1. Build a DAG representing your beliefs about the causal relationships between these variables.
  2. Visualize your DAG.
  3. Determine what variables you should control for to estimate the causal effect of education on income.
  4. How would your adjustment strategy change if some variables were unobserved?

Solutions

Question 1: Warmup

See Chapter 4 of The Book of Why by Pearl and Mackenzie. Alternatively, use what you learn in the rest of this exercise to check the answer with daggity!

Question 2: Effect of Exercise on Cancer

Show the solution
# 1. Create and plot the DAG 
exerciseDAG <- dagify(
  Y ~ D + U,
  D ~ V, 
  X ~ U + V,
  coords = list(x = c(U = 1, X = 1, V = 1, D = 2, Y = 3), 
                y = c(U = 3, X = 2, V = 1, D = 2, Y = 2))
)

exerciseDAG |> 
  ggdag(node_size = 15, text_size = 8) +
  theme_dag()
Show the solution
# 2. List all paths between D and Y
paths(exerciseDAG, from = 'D', to = 'Y')
$paths
[1] "D -> Y"                "D <- V -> X <- U -> Y"

$open
[1]  TRUE FALSE
Show the solution
# 3. Identify backdoor paths
# The path D <- V -> X <- U -> Y is a backdoor path: it starts with an arrow pointing into D

# 4. Should we adjust for X?
exerciseDAG |> 
  adjustmentSets(exposure = 'D', outcome = 'Y')
 {}
Show the solution
# Analysis:
# X is a collider on the backdoor path D <- V -> X <- U -> Y
# This path is already blocked because of the collider X
# If we condition on X, we would open this path, creating bias
# Therefore, we should NOT control for X when estimating the effect of physical activity on cancer

Question 3: Building Your Own DAG

Show the solution
# One possible DAG for the education-income example
educationDAG <- dagify(
  I ~ E + A + G + L,  # Income is caused by education, ability, gender, and location
  E ~ A + S + G + L,  # Education is caused by ability, SES, gender, and location
  coords = list(x = c(E = 2, I = 3, A = 1, S = 1, G = 2, L = 3),
                y = c(E = 2, I = 2, A = 3, S = 1, G = 3, L = 1))
)

# Visualize the DAG
educationDAG |> 
  ggdag(node_size = 15, text_size = 8) +
  theme_dag()

Show the solution
# Find adjustment sets
educationDAG |> 
  adjustmentSets(exposure = 'E', outcome = 'I')
{ A, G, L }
Show the solution
# If A is unobserved
educationDAG_unobserved <- dagify(
  I ~ E + A + G + L,
  E ~ A + S + G + L,
  latent = "A",  # Specify A as unobserved
  coords = list(x = c(E = 2, I = 3, A = 1, S = 1, G = 2, L = 3),
                y = c(E = 2, I = 2, A = 3, S = 1, G = 3, L = 1))
)

# Find adjustment sets with unobserved A
educationDAG_unobserved |> 
  adjustmentSets(exposure = 'E', outcome = 'I')

# In this case, because A is unobserved and it creates a backdoor path,
# we cannot identify the causal effect of education on income.