# Load required packages
library(ggdag)
library(dagitty)
library(tidyverse)Exercise - DAGs and Bad Controls
Overview
As we saw in the lecture on “Bad Controls”, Directed Acyclic Graphs (DAGs) are powerful tools for visualizing and analyzing causal relationships. They help us determine which variables we should control for when estimating causal effects from observational data.
In this exercise, we’ll work with DAGs to:
- Build and visualize causal models
- Identify different types of paths between variables
- Determine appropriate adjustment sets
- Understand when controlling for a variable can help or harm causal inference
DAGs in R
Required Packages
We’ll use two main packages:
ggdag: For creating and visualizing DAGsdagitty: For analyzing paths and determining adjustment sets
Building Your First DAG
Let’s start by building a simple DAG with four variables:
- \(D\): Treatment variable
- \(Y\): Outcome variable
- \(X\): Mediator/confounder
- \(Z\): Another variable in our causal system
The dagify() function allows us to specify the relationships between variables using formulas. Each formula represents the direct causes of a variable.
# Build a DAG
my_dag <- dagify(
Y ~ X + D + Z, # Y is caused by X, D, and Z
X ~ Z, # X is caused by Z
D ~ X
) # D is caused by X
# Print the DAG structure
my_dagdag {
D
X
Y
Z
D -> Y
X -> D
X -> Y
Z -> X
Z -> Y
}
Visualizing DAGs
The ggdag() function visualizes our DAG using ggplot2.
# Basic plot
my_dag |> ggdag()This basic plot is functional but not very elegant. Let’s improve it!
# Improved formatting
my_dag |>
ggdag(node_size = 15, text_size = 8) + # adjust size
theme_dag() # remove axes etc.The layout of our DAG is determined automatically. However, we can specify the exact coordinates for each node to create a more intuitive layout.
# Create DAG with custom coordinates
my_dag <- dagify(Y ~ X + D + Z,
X ~ Z,
D ~ X,
# Specify Cartesian coordinates of each node
coords = list(
x = c(Z = 2, X = 1, D = 1, Y = 2),
y = c(Z = 2, X = 2, D = 1, Y = 1)
)
)
# Plot with custom layout
my_dag |>
ggdag(node_size = 15, text_size = 8) +
theme_dag()Analyzing Paths in DAGs
A key step in causal analysis is identifying paths between variables. The paths() function helps us list all paths between two variables.
# List all paths between D and Y
paths(my_dag, from = "D", to = "Y")$paths
[1] "D -> Y" "D <- X -> Y" "D <- X <- Z -> Y"
$open
[1] TRUE TRUE TRUE
Not all paths are causal paths. To list only directed (causal) paths from \(D\) to \(Y\):
# List only directed (causal) paths
paths(my_dag, from = "D", to = "Y", directed = TRUE)$paths
[1] "D -> Y"
$open
[1] TRUE
Finding Adjustment Sets
To estimate causal effects, we need to block all backdoor paths between our treatment and outcome. The adjustmentSets() function automatically identifies variables that should be controlled for.
# What to adjust for to learn the D -> Y effect?
my_dag |>
adjustmentSets(exposure = "D", outcome = "Y"){ X }
# The X -> Y effect?
my_dag |>
adjustmentSets(exposure = "X", outcome = "Y"){ Z }
# The Z -> Y effect?
my_dag |>
adjustmentSets(exposure = "Z", outcome = "Y") {}
Now it’s your turn!
Exercises
Exercise 1: Effect of Exercise on Cancer
Consider the following DAG about exercise and cancer:
Where:
- \(D\): Physical activity
- \(Y\): Cervical cancer (Yes/No)
- \(X\): Positive pap smear test result (Yes/No)
- \(U\): Pre-cancer lesion (Yes/No) - unobserved
- \(V\): Health-consciousness - unobserved
Story: Health-conscious people tend to be more physically active and visit doctors more frequently, increasing the chance of a pap smear test.
Questions
- Plot the DAG.
- List all paths between \(D\) and \(Y\). Hint: You can use
paths()fromRfor this! - Are there any backdoor paths between \(D\) and \(Y\)?
- Should we adjust for \(X\) when estimating the effect of physical activity (\(D\)) on cervical cancer (\(Y\))? Why or why not?
Solutions
Show the solution for Question 1
exercise_dag <- dagify(
Y ~ D + U,
D ~ V,
X ~ U + V,
coords = list(
x = c(U = 1, X = 1, V = 1, D = 2, Y = 3),
y = c(U = 3, X = 2, V = 1, D = 2, Y = 2)
)
)
exercise_dag |>
ggdag(node_size = 15, text_size = 8) +
theme_dag()Show the solution for Question 2
# 2. List all paths between D and Y
paths(exercise_dag, from = "D", to = "Y")Show the solution for Question 3
# The path D <- V -> X <- U -> Y is a backdoor path: it starts with an arrow
# pointing into DShow the solution for Question 4
exercise_dag |>
adjustmentSets(exposure = "D", outcome = "Y")
# Analysis:
# X is a collider on the backdoor path D <- V -> X <- U -> Y
# This path is already blocked because of the collider X
# If we condition on X, we would open this path, creating bias
# Therefore, we should NOT control for X when estimating the effect of physical
# activity on cancerExercise 2: Building Your Own DAG
Consider a study on the effect of education (\(E\)) on income (\(I\)). Here are some variables that might be involved:
- \(E\): Education level
- \(I\): Income
- \(A\): Ability (unobserved)
- \(S\): Socioeconomic status of parents
- \(G\): Gender
- \(L\): Location (urban/rural)
Questions
- Build a DAG representing your beliefs about the causal relationships between these variables.
- Visualize your DAG.
- Determine what variables you should control for to estimate the causal effect of education on income.
- How would your adjustment strategy change if some variables were unobserved?
Solutions
Show the solution
# One possible DAG for the education-income example
# Your solution will be different if you specify a different DAG here!
education_dag <- dagify(
I ~ E + A + G + L, # Income caused by education, ability, gender, and location
E ~ A + S + G + L, # Education caused by ability, SES, gender, and location
coords = list(
x = c(E = 2, I = 3, A = 1, S = 1, G = 2, L = 3),
y = c(E = 2, I = 2, A = 3, S = 1, G = 3, L = 1)
)
)
# Visualize the DAG
education_dag |>
ggdag(node_size = 15, text_size = 8) +
theme_dag()
# Find adjustment sets
education_dag |>
adjustmentSets(exposure = "E", outcome = "I")
# If A is unobserved
education_dag_unobserved <- dagify(
I ~ E + A + G + L,
E ~ A + S + G + L,
latent = "A", # Specify A as unobserved
coords = list(
x = c(E = 2, I = 3, A = 1, S = 1, G = 2, L = 3),
y = c(E = 2, I = 2, A = 3, S = 1, G = 3, L = 1)
)
)
# Find adjustment sets with unobserved A
education_dag_unobserved |>
adjustmentSets(exposure = "E", outcome = "I")
# In this case, because A is unobserved and it creates a backdoor path,
# we cannot identify the causal effect of education on income.