# Load required packages
library(ggdag)
library(dagitty)
library(tidyverse) # For data manipulation and visualization
Exercise - DAGs and Bad Controls
Overview
As we saw in the lecture on “Bad Controls”, Directed Acyclic Graphs (DAGs) are powerful tools for visualizing and analyzing causal relationships. They help us determine which variables we should control for when estimating causal effects from observational data.
In this exercise, we’ll work with DAGs to:
- Build and visualize causal models
- Identify different types of paths between variables
- Determine appropriate adjustment sets
- Understand when controlling for a variable can help or harm causal inference
Required Packages
We’ll use two main packages:
- ggdag: For creating and visualizing DAGs
- dagitty: For analyzing paths and determining adjustment sets
Question 1: Warm-up
- Find all paths between X and Y.
- Find all backdoor paths between X and Y.
- According to the back door criterion, which variables should we adjust for to estimate the causal effect of X on Y?
Building Your First DAG
Let’s start by building a simple DAG with four variables:
- D: Treatment variable
- Y: Outcome variable
- X: Mediator/confounder
- Z: Another variable in our causal system
The dagify()
function allows us to specify the relationships between variables using formulas. Each formula represents the direct causes of a variable.
# Build a DAG
<- dagify(Y ~ X + D + Z, # Y is caused by X, D, and Z
myDAG ~ Z, # X is caused by Z
X ~ X) # D is caused by X
D
# Print the DAG structure
myDAG
dag {
D
X
Y
Z
D -> Y
X -> D
X -> Y
Z -> X
Z -> Y
}
Visualizing DAGs
The ggdag()
function visualizes our DAG using ggplot2.
# Basic plot
|> ggdag() myDAG
This basic plot is functional but not very elegant. Let’s improve it!
# Improved formatting
|>
myDAG ggdag(node_size = 15, text_size = 8) + # adjust size
theme_dag() # remove axes etc.
The layout of our DAG is determined automatically. However, we can specify the exact coordinates for each node to create a more intuitive layout.
# Create DAG with custom coordinates
<- dagify(Y ~ X + D + Z,
myDAG ~ Z,
X ~ X,
D # Specify Cartesian coordinates of each node
coords = list(x = c(Z = 2, X = 1, D = 1, Y = 2),
y = c(Z = 2, X = 2, D = 1, Y = 1))
)
# Plot with custom layout
|>
myDAG ggdag(node_size = 15, text_size = 8) +
theme_dag()
Analyzing Paths in DAGs
A key step in causal analysis is identifying paths between variables. The paths()
function helps us list all paths between two variables.
# List all paths between D and Y
paths(myDAG, from = 'D', to = 'Y')
$paths
[1] "D -> Y" "D <- X -> Y" "D <- X <- Z -> Y"
$open
[1] TRUE TRUE TRUE
Not all paths are causal paths. To list only directed (causal) paths from D to Y:
# List only directed (causal) paths
paths(myDAG, from = 'D', to = 'Y', directed = TRUE)
$paths
[1] "D -> Y"
$open
[1] TRUE
Finding Adjustment Sets
To estimate causal effects, we need to block all backdoor paths between our treatment and outcome. The adjustmentSets()
function automatically identifies variables that should be controlled for.
# What to adjust for to learn the D -> Y effect?
|>
myDAG adjustmentSets(exposure = 'D', outcome = 'Y')
{ X }
# The X -> Y effect?
|>
myDAG adjustmentSets(exposure = 'X', outcome = 'Y')
{ Z }
# The Z -> Y effect?
|>
myDAG adjustmentSets(exposure = 'Z', outcome = 'Y')
{}
Question 2: Effect of Exercise on Cancer
Consider the following DAG about exercise and cancer:
Where:
- D: Physical activity
- Y: Cervical cancer (Yes/No)
- X: Positive pap smear test result (Yes/No)
- U: Pre-cancer lesion (Yes/No) - unobserved
- V: Health-consciousness - unobserved
Story: Health-conscious people tend to be more physically active and visit doctors more frequently, increasing the chance of a pap smear test.
Questions:
- Plot the DAG.
- List all paths between D and Y.
- Are there any backdoor paths between D and Y?
- Should we adjust for X when estimating the effect of physical activity (D) on cervical cancer (Y)? Why or why not?
Question 3: Building Your Own DAG
Consider a study on the effect of education (E) on income (I). Here are some variables that might be involved:
- E: Education level
- I: Income
- A: Ability (unobserved)
- S: Socioeconomic status of parents
- G: Gender
- L: Location (urban/rural)
Questions:
- Build a DAG representing your beliefs about the causal relationships between these variables.
- Visualize your DAG.
- Determine what variables you should control for to estimate the causal effect of education on income.
- How would your adjustment strategy change if some variables were unobserved?
Solutions
Question 1: Warmup
See Chapter 4 of The Book of Why by Pearl and Mackenzie. Alternatively, use what you learn in the rest of this exercise to check the answer with daggity
!
Question 2: Effect of Exercise on Cancer
Show the solution
# 1. Create and plot the DAG
<- dagify(
exerciseDAG ~ D + U,
Y ~ V,
D ~ U + V,
X coords = list(x = c(U = 1, X = 1, V = 1, D = 2, Y = 3),
y = c(U = 3, X = 2, V = 1, D = 2, Y = 2))
)
|>
exerciseDAG ggdag(node_size = 15, text_size = 8) +
theme_dag()
Show the solution
# 2. List all paths between D and Y
paths(exerciseDAG, from = 'D', to = 'Y')
$paths
[1] "D -> Y" "D <- V -> X <- U -> Y"
$open
[1] TRUE FALSE
Show the solution
# 3. Identify backdoor paths
# The path D <- V -> X <- U -> Y is a backdoor path: it starts with an arrow pointing into D
# 4. Should we adjust for X?
|>
exerciseDAG adjustmentSets(exposure = 'D', outcome = 'Y')
{}
Show the solution
# Analysis:
# X is a collider on the backdoor path D <- V -> X <- U -> Y
# This path is already blocked because of the collider X
# If we condition on X, we would open this path, creating bias
# Therefore, we should NOT control for X when estimating the effect of physical activity on cancer
Question 3: Building Your Own DAG
Show the solution
# One possible DAG for the education-income example
<- dagify(
educationDAG ~ E + A + G + L, # Income is caused by education, ability, gender, and location
I ~ A + S + G + L, # Education is caused by ability, SES, gender, and location
E coords = list(x = c(E = 2, I = 3, A = 1, S = 1, G = 2, L = 3),
y = c(E = 2, I = 2, A = 3, S = 1, G = 3, L = 1))
)
# Visualize the DAG
|>
educationDAG ggdag(node_size = 15, text_size = 8) +
theme_dag()
Show the solution
# Find adjustment sets
|>
educationDAG adjustmentSets(exposure = 'E', outcome = 'I')
{ A, G, L }
Show the solution
# If A is unobserved
<- dagify(
educationDAG_unobserved ~ E + A + G + L,
I ~ A + S + G + L,
E latent = "A", # Specify A as unobserved
coords = list(x = c(E = 2, I = 3, A = 1, S = 1, G = 2, L = 3),
y = c(E = 2, I = 2, A = 3, S = 1, G = 3, L = 1))
)
# Find adjustment sets with unobserved A
|>
educationDAG_unobserved adjustmentSets(exposure = 'E', outcome = 'I')
# In this case, because A is unobserved and it creates a backdoor path,
# we cannot identify the causal effect of education on income.