Exercise - Contaminated Wells in Bangladesh

This problem uses a dataset containing household-level data from Bangladesh, wells.csv, that you can download from my website: https://ditraglia.com/data/wells.csv. Here is some background on the dataset from Gelman and Hill (2007):

Many of the wells used for drinking water in Bangladesh and other South Asian countries are contaminated with natural arsenic … a research team from the United States and Bangladesh measured all the wells [in a small region] and labeled them with their arsenic level as well as a characterization of “safe” (below 0.5 in units of hundreds of micrograms per liter, the Bangladesh standard for arsenic in drinking water) or “unsafe” (above 0.5). People with unsafe wells were encouraged to switch to nearby private or community wells or to new wells of their own construction. A few years later, the researchers returned to find out who had switched wells.

You task in this problem is to predict which households will switch wells using the following information:

Name	Description
dist	Distance to closest known safe well (meters)
arsenic	Arsenic level of respondent’s well (100s of micrograms per liter)
switch	Dummy variable equal to 1 if switched to a new well
assoc	Dummy variable equal to 1 if any member of the household is active in community organizations
educ	Education level of head of household (years)

To be clear, the dataset contains only information for households with an arsenic level of 0.5 or above, as these are the households that were encouraged to switch wells. Moreover, the variable switch is measured after the variables dist100 and arsenic. In other words, arsenic is the arsenic level in the original contaminated well that the household was using before we learned whether they switched wells. Similarly, dist100 was the distance to the nearest well safe well before we know whether the household switched wells.

To answer this question, you will need three additional pieces of information:

The Bayes classifier is a rule for predicting whether \(Y=1\) or \(Y=0\) given \(X\) based on the estimated probability \(\hat{P}(Y=1|X)\). If this probability is greater than \(1/2\), the Bayes classifier predicts \(\hat{Y} = 1\); otherwise it predicts \(\hat{Y} = 0\).
The error rate of a classifier is defined as the fraction of incorrect predictions that it makes.
For a binary prediction problem the confusion matrix is a \(2\times 2\) matrix that counts up the number of true positives, \((\hat{Y} = Y = 1)\), true negatives \((\hat{Y} = Y = 0)\), false positives \((\hat{Y} = 1, Y=0)\), and false negatives \((\hat{Y} = 0, Y = 0)\). The confusion matrix can be used, among other things, to calculate the sensitivity and specificity of a classifier.

Exercises

Preliminaries:
1. Load the data and store it in a tibble called wells.
2. Use dplyr to create a variable called larsenic that equals the natural logarithm of arsenic.
3. Use ggplot2 to make a histogram of arsenic and larsenic. Be sure to label your plots appropriately. Comment on your findings.
4. Use dplyr to create a variable called dist100 that contains the same information as dist but measured in hundreds of meters rather than in meters.
5. Use dplyr to create a variable called zeduc that equals the z-score of educ, i.e. educ after subtracting its mean and dividing by its standard deviation.
First Regression: fit1
1. Run a logistic regression using dist100 to predict switch and store the result in an object called fit1.
2. Use ggplot2 to plot the logistic regression function from part (a) along with the data, jittered appropriately.
3. Discuss your results from parts (a)-(b). In particular: based on fit1, is dist100 a statistically significant predictor of switch? Does the sign of its coefficient make sense? Explain.
4. Use fit1 to calculate the predicted probability of switching wells for the average household in the dataset.
5. Use fit1 to calculate the marginal effect of dist100 for the average household in the dataset. Interpret your result. How does is compare to the “divide by 4” rule and the average partial effect?
Predictive performance of fit1
1. Add a column called p1 to wells containing the predicted probabilities that switch equals one based on fit1.
2. Add a column called pred1 to wells that gives the predicted values of \(y\) that correspond to p1.
3. Use pred1 to calculate the error rate of the Bayes classifier constructed from fit1 based on the full dataset, i.e. wells. Recall that this classifier uses the predicted probabilities from fit1 in the following way: \(p>0.5 \implies\) predict 1, \(p\leq 0.5\implies\) predict 0. Hint: you can do this using the summarize function from dplyr.
4. Use pred1 to construct the confusion matrix for fit1. Hint: use the function table.
5. Based on your results from (d), calculate the sensitivity and specificity of the predictions from fit1.
6. Comment on your results from (c) and (e). In particular, compare them to the error rate that you would obtain by simply predicting the most common value of switch for every observation in the dataset. (This is called the “null model” since it doesn’t use any predictors.)
Additional regressions: fit2, fit3, and fit4
1. Run a logistic regression using larsenic to predict switch and store the results in an object called fit2.
2. Run a logistic regression using zeduc to predict switch and store the results in an object called fit3.
3. Run a logistic regression using dist100, larsenic, and zeduc to predict switch and store the results in an object called fit4.
4. Make a nicely-formatted summary table of the results from fit1, fit2, fit3, and fit4. Make sure to add appropriate labels and captions, use a reasonable number of decimal places, etc.
Interpreting fit2, fit3 and fit4
1. Repeat parts (b) and (c) of question #2 above with fit2 in place of fit1.
2. Repeat parts (b) and (c) of question #2 above with fit3 in place of fit1.
3. Calculate the marginal effect of each predictor in fit4 for the average household in the dataset. Interpret your results. How do they compare to the “divide by 4” rule?
Predictive Performance of fit4
1. Repeat question #3 from above with fit4, p4, and pred4 in place of fit1, p1 and pred1.
2. Using your results from (a) and question #3 above, compare the in-sample predictive performance of fit1 and fit4.