# Exercise - Contaminated Wells in Bangladesh

This problem uses a dataset containing household-level data from Bangladesh, `wells.csv`

, that you can download from my website: `https://ditraglia.com/data/wells.csv`

. Here is some background on the dataset from Gelman and Hill (2007):

Many of the wells used for drinking water in Bangladesh and other South Asian countries are contaminated with natural arsenic … a research team from the United States and Bangladesh measured all the wells [in a small region] and labeled them with their arsenic level as well as a characterization of “safe” (below 0.5 in units of hundreds of micrograms per liter, the Bangladesh standard for arsenic in drinking water) or “unsafe” (above 0.5). People with unsafe wells were encouraged to switch to nearby private or community wells or to new wells of their own construction. A few years later, the researchers returned to find out who had switched wells.

You task in this problem is to predict which households will switch wells using the following information:

Name | Description |
---|---|

dist | Distance to closest known safe well (meters) |

arsenic | Arsenic level of respondent’s well (100s of micrograms per liter) |

switch | Dummy variable equal to 1 if switched to a new well |

assoc | Dummy variable equal to 1 if any member of the household is active in community organizations |

educ | Education level of head of household (years) |

To be clear, the dataset contains *only* information for households with an arsenic level of 0.5 or above, as these are the households that were encouraged to switch wells. Moreover, the variable `switch`

is measured *after* the variables `dist100`

and `arsenic`

. In other words, `arsenic`

is the arsenic level in the original contaminated well that the household was using *before* we learned whether they switched wells. Similarly, `dist100`

was the distance to the nearest well safe well before we know whether the household switched wells.

To answer this question, you will need three additional pieces of information:

- The
*Bayes classifier*is a rule for predicting whether \(Y=1\) or \(Y=0\) given \(X\) based on the estimated probability \(\hat{P}(Y=1|X)\). If this probability is greater than \(1/2\), the Bayes classifier predicts \(\hat{Y} = 1\); otherwise it predicts \(\hat{Y} = 0\). - The
*error rate*of a classifier is defined as the fraction of incorrect predictions that it makes. - For a binary prediction problem the
*confusion matrix*is a \(2\times 2\) matrix that counts up the number of true positives, \((\hat{Y} = Y = 1)\), true negatives \((\hat{Y} = Y = 0)\), false positives \((\hat{Y} = 1, Y=0)\), and false negatives \((\hat{Y} = 0, Y = 0)\). The confusion matrix can be used, among other things, to calculate the sensitivity and specificity of a classifier.

# Exercises

- Preliminaries:
- Load the data and store it in a tibble called
`wells`

. - Use
`dplyr`

to create a variable called`larsenic`

that equals the natural logarithm of`arsenic`

. - Use
`ggplot2`

to make a histogram of`arsenic`

and`larsenic`

. Be sure to label your plots appropriately. Comment on your findings. - Use
`dplyr`

to create a variable called`dist100`

that contains the same information as`dist`

but measured in*hundreds of meters*rather than in meters. - Use
`dplyr`

to create a variable called`zeduc`

that equals the*z-score*of`educ`

, i.e.`educ`

after subtracting its mean and dividing by its standard deviation.

- Load the data and store it in a tibble called
- First Regression:
`fit1`

- Run a logistic regression using
`dist100`

to predict`switch`

and store the result in an object called`fit1`

. - Use
`ggplot2`

to plot the logistic regression function from part (a) along with the data, jittered appropriately. - Discuss your results from parts (a)-(b). In particular: based on
`fit1`

, is`dist100`

a statistically significant predictor of`switch`

? Does the sign of its coefficient make sense? Explain. - Use
`fit1`

to calculate the predicted probability of switching wells for the*average*household in the dataset. - Use
`fit1`

to calculate the marginal effect of`dist100`

for the*average*household in the dataset. Interpret your result. How does is compare to the “divide by 4” rule and the average partial effect?

- Run a logistic regression using
- Predictive performance of
`fit1`

- Add a column called
`p1`

to`wells`

containing the predicted probabilities that`switch`

equals one based on`fit1`

. - Add a column called
`pred1`

to`wells`

that gives the predicted*values*of \(y\) that correspond to`p1`

. - Use
`pred1`

to calculate the*error rate*of the Bayes classifier constructed from`fit1`

based on the full dataset, i.e.`wells`

. Recall that this classifier uses the predicted probabilities from`fit1`

in the following way: \(p>0.5 \implies\) predict 1, \(p\leq 0.5\implies\) predict 0. Hint: you can do this using the`summarize`

function from`dplyr`

. - Use
`pred1`

to construct the*confusion matrix*for`fit1`

. Hint: use the function`table`

. - Based on your results from (d), calculate the
*sensitivity*and*specificity*of the predictions from`fit1`

. - Comment on your results from (c) and (e). In particular, compare them to the error rate that you would obtain by simply predicting the
*most common*value of`switch`

for every observation in the dataset. (This is called the “null model” since it doesn’t use any predictors.)

- Add a column called
- Additional regressions:
`fit2`

,`fit3`

, and`fit4`

- Run a logistic regression using
`larsenic`

to predict`switch`

and store the results in an object called`fit2`

. - Run a logistic regression using
`zeduc`

to predict`switch`

and store the results in an object called`fit3`

. - Run a logistic regression using
`dist100`

,`larsenic`

, and`zeduc`

to predict`switch`

and store the results in an object called`fit4`

. - Make a nicely-formatted summary table of the results from
`fit1`

,`fit2`

,`fit3`

, and`fit4`

. Make sure to add appropriate labels and captions, use a reasonable number of decimal places, etc.

- Run a logistic regression using
- Interpreting
`fit2`

,`fit3`

and`fit4`

- Repeat parts (b) and (c) of question #2 above with
`fit2`

in place of`fit1`

. - Repeat parts (b) and (c) of question #2 above with
`fit3`

in place of`fit1`

. - Calculate the marginal effect of
*each predictor*in`fit4`

for the*average*household in the dataset. Interpret your results. How do they compare to the “divide by 4” rule?

- Repeat parts (b) and (c) of question #2 above with
- Predictive Performance of
`fit4`

- Repeat question #3 from above with
`fit4`

,`p4`

, and`pred4`

in place of`fit1`

,`p1`

and`pred1`

. - Using your results from (a) and question #3 above, compare the in-sample predictive performance of
`fit1`

and`fit4`

.

- Repeat question #3 from above with