Problem Set - Contaminated Wells in Bangladesh
This problem uses a dataset containing household-level data from Bangladesh, wells.csv, that you can download from my website: https://ditraglia.com/data/wells.csv. Here is some background on the dataset from Gelman and Hill (2007):
Many of the wells used for drinking water in Bangladesh and other South Asian countries are contaminated with natural arsenic … a research team from the United States and Bangladesh measured all the wells [in a small region] and labeled them with their arsenic level as well as a characterization of “safe” (below 0.5 in units of hundreds of micrograms per liter, the Bangladesh standard for arsenic in drinking water) or “unsafe” (above 0.5). People with unsafe wells were encouraged to switch to nearby private or community wells or to new wells of their own construction. A few years later, the researchers returned to find out who had switched wells.
You task in this problem is to predict which households will switch wells using the following information:
| Name | Description |
|---|---|
| dist | Distance to closest known safe well (meters) |
| arsenic | Arsenic level of respondent’s well (100s of micrograms per liter) |
| switch | Dummy variable equal to 1 if switched to a new well |
| assoc | Dummy variable equal to 1 if any member of the household is active in community organizations |
| educ | Education level of head of household (years) |
To be clear, the dataset contains only information for households with an arsenic level of 0.5 or above, as these are the households that were encouraged to switch wells. Moreover, the variable switch is measured after the variables dist100 and arsenic. In other words, arsenic is the arsenic level in the original contaminated well that the household was using before we learned whether they switched wells. Similarly, dist100 was the distance to the nearest well safe well before we know whether the household switched wells.
To answer this question, you will need three additional pieces of information:
- The Bayes classifier is a rule for predicting whether \(Y=1\) or \(Y=0\) given \(X\) based on the estimated probability \(\hat{P}(Y=1|X)\). If this probability is greater than \(1/2\), the Bayes classifier predicts \(\hat{Y} = 1\); otherwise it predicts \(\hat{Y} = 0\).
- The error rate of a classifier is defined as the fraction of incorrect predictions that it makes.
- For a binary prediction problem the confusion matrix is a \(2\times 2\) matrix that counts up the number of true positives, \((\hat{Y} = Y = 1)\), true negatives \((\hat{Y} = Y = 0)\), false positives \((\hat{Y} = 1, Y=0)\), and false negatives \((\hat{Y} = 0, Y = 1)\). The confusion matrix can be used, among other things, to calculate the sensitivity and specificity of a classifier.
Exercises
- Preliminaries:
- Load the data and store it in a tibble called
wells. - Use
dplyrto create a variable calledlarsenicthat equals the natural logarithm ofarsenic. - Use
ggplot2to make a histogram ofarsenicandlarsenic. Be sure to label your plots appropriately. Comment on your findings. - Use
dplyrto create a variable calleddist100that contains the same information asdistbut measured in hundreds of meters rather than in meters. - Use
dplyrto create a variable calledzeducthat equals the z-score ofeduc, i.e.educafter subtracting its mean and dividing by its standard deviation.
- Load the data and store it in a tibble called
- First Regression:
fit1- Run a logistic regression using
dist100to predictswitchand store the result in an object calledfit1. - Use
ggplot2to plot the logistic regression function from part (a) along with the data, jittered appropriately. - Discuss your results from parts (a)-(b). In particular: based on
fit1, isdist100a statistically significant predictor ofswitch? Does the sign of its coefficient make sense? Explain. - Use
fit1to calculate the predicted probability of switching wells for the average household in the dataset. - Use
fit1to calculate the marginal effect ofdist100for the average household in the dataset. Interpret your result. How does is compare to the “divide by 4” rule and the average partial effect?
- Run a logistic regression using
- Predictive performance of
fit1- Add a column called
p1towellscontaining the predicted probabilities thatswitchequals one based onfit1. - Add a column called
pred1towellsthat gives the predicted values of \(y\) that correspond top1. - Use
pred1to calculate the error rate of the Bayes classifier constructed fromfit1based on the full dataset, i.e.wells. Recall that this classifier uses the predicted probabilities fromfit1in the following way: \(p>0.5 \implies\) predict 1, \(p\leq 0.5\implies\) predict 0. Hint: you can do this using thesummarizefunction fromdplyr. - Use
pred1to construct the confusion matrix forfit1. Hint: use the functiontable. - Based on your results from (d), calculate the sensitivity and specificity of the predictions from
fit1. - Comment on your results from (c) and (e). In particular, compare them to the error rate that you would obtain by simply predicting the most common value of
switchfor every observation in the dataset. (This is called the “null model” since it doesn’t use any predictors.)
- Add a column called
- Additional regressions:
fit2,fit3, andfit4- Run a logistic regression using
larsenicto predictswitchand store the results in an object calledfit2. - Run a logistic regression using
zeducto predictswitchand store the results in an object calledfit3. - Run a logistic regression using
dist100,larsenic, andzeducto predictswitchand store the results in an object calledfit4. - Make a nicely-formatted summary table of the results from
fit1,fit2,fit3, andfit4. Make sure to add appropriate labels and captions, use a reasonable number of decimal places, etc.
- Run a logistic regression using
- Interpreting
fit2,fit3andfit4- Repeat parts (b) and (c) of question #2 above with
fit2in place offit1. - Repeat parts (b) and (c) of question #2 above with
fit3in place offit1. - Calculate the marginal effect of each predictor in
fit4for the average household in the dataset. Interpret your results. How do they compare to the “divide by 4” rule?
- Repeat parts (b) and (c) of question #2 above with
- Predictive Performance of
fit4- Repeat question #3 from above with
fit4,p4, andpred4in place offit1,p1andpred1. - Using your results from (a) and question #3 above, compare the in-sample predictive performance of
fit1andfit4.
- Repeat question #3 from above with