Exercise - Contaminated Wells in Bangladesh
This problem uses a dataset containing household-level data from Bangladesh, wells.csv
, that you can download from my website: https://ditraglia.com/data/wells.csv
. Here is some background on the dataset from Gelman and Hill (2007):
Many of the wells used for drinking water in Bangladesh and other South Asian countries are contaminated with natural arsenic … a research team from the United States and Bangladesh measured all the wells [in a small region] and labeled them with their arsenic level as well as a characterization of “safe” (below 0.5 in units of hundreds of micrograms per liter, the Bangladesh standard for arsenic in drinking water) or “unsafe” (above 0.5). People with unsafe wells were encouraged to switch to nearby private or community wells or to new wells of their own construction. A few years later, the researchers returned to find out who had switched wells.
You task in this problem is to predict which households will switch wells using the following information:
Name | Description |
---|---|
dist | Distance to closest known safe well (meters) |
arsenic | Arsenic level of respondent’s well (100s of micrograms per liter) |
switch | Dummy variable equal to 1 if switched to a new well |
assoc | Dummy variable equal to 1 if any member of the household is active in community organizations |
educ | Education level of head of household (years) |
To be clear, the dataset contains only information for households with an arsenic level of 0.5 or above, as these are the households that were encouraged to switch wells. Moreover, the variable switch
is measured after the variables dist100
and arsenic
. In other words, arsenic
is the arsenic level in the original contaminated well that the household was using before we learned whether they switched wells. Similarly, dist100
was the distance to the nearest well safe well before we know whether the household switched wells.
To answer this question, you will need three additional pieces of information:
- The Bayes classifier is a rule for predicting whether \(Y=1\) or \(Y=0\) given \(X\) based on the estimated probability \(\hat{P}(Y=1|X)\). If this probability is greater than \(1/2\), the Bayes classifier predicts \(\hat{Y} = 1\); otherwise it predicts \(\hat{Y} = 0\).
- The error rate of a classifier is defined as the fraction of incorrect predictions that it makes.
- For a binary prediction problem the confusion matrix is a \(2\times 2\) matrix that counts up the number of true positives, \((\hat{Y} = Y = 1)\), true negatives \((\hat{Y} = Y = 0)\), false positives \((\hat{Y} = 1, Y=0)\), and false negatives \((\hat{Y} = 0, Y = 0)\). The confusion matrix can be used, among other things, to calculate the sensitivity and specificity of a classifier.
Exercises
- Preliminaries:
- Load the data and store it in a tibble called
wells
. - Use
dplyr
to create a variable calledlarsenic
that equals the natural logarithm ofarsenic
. - Use
ggplot2
to make a histogram ofarsenic
andlarsenic
. Be sure to label your plots appropriately. Comment on your findings. - Use
dplyr
to create a variable calleddist100
that contains the same information asdist
but measured in hundreds of meters rather than in meters. - Use
dplyr
to create a variable calledzeduc
that equals the z-score ofeduc
, i.e.educ
after subtracting its mean and dividing by its standard deviation.
- Load the data and store it in a tibble called
- First Regression:
fit1
- Run a logistic regression using
dist100
to predictswitch
and store the result in an object calledfit1
. - Use
ggplot2
to plot the logistic regression function from part (a) along with the data, jittered appropriately. - Discuss your results from parts (a)-(b). In particular: based on
fit1
, isdist100
a statistically significant predictor ofswitch
? Does the sign of its coefficient make sense? Explain. - Use
fit1
to calculate the predicted probability of switching wells for the average household in the dataset. - Use
fit1
to calculate the marginal effect ofdist100
for the average household in the dataset. Interpret your result. How does is compare to the “divide by 4” rule and the average partial effect?
- Run a logistic regression using
- Predictive performance of
fit1
- Add a column called
p1
towells
containing the predicted probabilities thatswitch
equals one based onfit1
. - Add a column called
pred1
towells
that gives the predicted values of \(y\) that correspond top1
. - Use
pred1
to calculate the error rate of the Bayes classifier constructed fromfit1
based on the full dataset, i.e.wells
. Recall that this classifier uses the predicted probabilities fromfit1
in the following way: \(p>0.5 \implies\) predict 1, \(p\leq 0.5\implies\) predict 0. Hint: you can do this using thesummarize
function fromdplyr
. - Use
pred1
to construct the confusion matrix forfit1
. Hint: use the functiontable
. - Based on your results from (d), calculate the sensitivity and specificity of the predictions from
fit1
. - Comment on your results from (c) and (e). In particular, compare them to the error rate that you would obtain by simply predicting the most common value of
switch
for every observation in the dataset. (This is called the “null model” since it doesn’t use any predictors.)
- Add a column called
- Additional regressions:
fit2
,fit3
, andfit4
- Run a logistic regression using
larsenic
to predictswitch
and store the results in an object calledfit2
. - Run a logistic regression using
zeduc
to predictswitch
and store the results in an object calledfit3
. - Run a logistic regression using
dist100
,larsenic
, andzeduc
to predictswitch
and store the results in an object calledfit4
. - Make a nicely-formatted summary table of the results from
fit1
,fit2
,fit3
, andfit4
. Make sure to add appropriate labels and captions, use a reasonable number of decimal places, etc.
- Run a logistic regression using
- Interpreting
fit2
,fit3
andfit4
- Repeat parts (b) and (c) of question #2 above with
fit2
in place offit1
. - Repeat parts (b) and (c) of question #2 above with
fit3
in place offit1
. - Calculate the marginal effect of each predictor in
fit4
for the average household in the dataset. Interpret your results. How do they compare to the “divide by 4” rule?
- Repeat parts (b) and (c) of question #2 above with
- Predictive Performance of
fit4
- Repeat question #3 from above with
fit4
,p4
, andpred4
in place offit1
,p1
andpred1
. - Using your results from (a) and question #3 above, compare the in-sample predictive performance of
fit1
andfit4
.
- Repeat question #3 from above with