Data Analysis Lecture

BEE - HT 2024

Francis J. DiTraglia

University of Oxford

R Resources

Basics

More Advanced / Cooler

How did I make these slides?

  • I most definitely did not copy-and-paste R code, results and figures into Powerpoint!
  • I used Quarto. I could just as easily have used R Markdown.
  • The first few slides of this slide deck give an overview.
  • Chapter 29 of R for Data Science explains Quarto.
  • It’s totally optional, but you can use these tools to generate your BEE reports if you like.

Analyzing Data from a Simple RCT

The Anchoring Experiment

  • Classic experiment from Kahneman & Tversky; data from my undergraduate class at UPenn
  • Question: does irrelevant information change behavior?
  • Students randomly allocated to “Lo” and “Hi” groups.
    • “Lo” group shown the number 10
    • “Hi” group shown the number 65
  • Asked a pair of questions about African countries in the UN.

Experimental Instructions

We chose (by computer) a random number between 0 and 100. The number selected and assigned to you is written on a slip of paper in front of you. Please do not show your number to anyone else or look at anyone else’s number.

Call your random number X. Do you think that the percentage of countries, among all those in the United Nations, that are in Africa is higher or lower than X?

What is your best estimate of the percentage of countries, among all those that are in the United Nations, that are in Africa?

Read in the data

  • read.csv() to read the file old_survey.csv from my website and store it as an R dataframe called anchoring.
  • names() to list the names of all the columns in anchoring
  • head() to display the first few rows of the dataframe.
  • Each row is a student:
    • rand.num: experimental treatment (10 or 65)
    • africa.higher: answer to first question (“higher or lower than X?”)
    • africa.percent: answer to second question (“percentage of countries” )
    • remaining columns: survey responses, some missing (NA)

Read in the data

anchoring <- read.csv('https://ditraglia.com/data/old_survey.csv')
names(anchoring)
 [1] "sex"            "credits"        "eye.color"      "handedness"    
 [5] "height"         "handspan"       "bias"           "rand.num"      
 [9] "africa.higher"  "africa.percent"
head(anchoring)
     sex credits eye.color handedness height handspan bias rand.num
1   Male       5     Brown        1.0     67     20.0   No       10
2 Female       5     Brown        0.4     63     19.5  Yes       65
3 Female      NA     Brown        0.6     62     19.0   No       65
4 Female       5     Brown        0.6     65     19.5   No       65
5 Female       4     Brown        1.0     62     18.5  Yes       65
6 Female      NA     Brown        1.0     68     18.5  Yes       65
  africa.higher africa.percent
1        Higher             20
2         Lower             25
3         Lower             43
4         Lower             20
5         Lower             26
6         Lower             60

Access a Column with $

anchoring$rand.num
 [1] 10 65 65 65 65 65 65 10 65 65 65 10 65 10 10 65 10 65 10 10 65 10 10 65 10
[26] 10 10 10 65 65 10 65 10 10 65 65 65 65 65 10 65 65 10 65 10 10 65 10 10 65
[51] 65 10 10 65 65 65 10 65 65 10 65 65 10 65 10 10 10 65 65 10 65 10 10 10 65
[76] 10 65 65 10 65 65 10 10 10 10 65 65 10 10

Boxplots

  • boxplot: a graphical depiction of “five number summary” of a dataset:
    • Minimum
    • First Quartile (25th percentile)
    • Median (50th percentile)
    • Third Quartile (75th percentile)
    • Maximum
  • box: contains the middle 75% of the data; width of box: “interquartile range”
  • thick line: median; thin lines: 1st and 3rd quartiles
  • whiskers: min and max of the data
  • caveat: any observation more than 1.5 times the width of the box away from the box is considered an outlier by boxplot(), shown as a small circle.

Boxplot for height

boxplot(anchoring$height, 
        main = 'No outliers for height!', 
        ylab = 'Height (inches)')

Boxplot for handspan

boxplot(anchoring$handspan, 
        main = 'Two outliers for handspan! (Errors?)', 
        ylab = 'Height (inches)')

Boxplots by Group

  • Compare two boxplots: “Lo” versus “Hi” groups
  • Formula syntax: OUTCOME_VARIABLE_HERE ~ GROUPING_VARIABLE_HERE
  • What do we conclude?
boxplot(anchoring$africa.percent ~ anchoring$rand.num, 
        main = 'Boxplot for Anchoring Experiment', 
        ylab = 'Answer (% UN Countries from Africa)', 
        xlab = 'Random Number')

Summary Statistics

  • mean() to compute the sample mean of a vector
  • median() to compute the sample median
  • sd() to compute the sample standard deviation
  • var() to compute the sample variance
  • IQR() to compute the interquartile range

Missing Values

  • NA means “I don’t know”
  • Some students only compelted part of the survey: their heights are NA (missing)
  • The mean of a vector with missing values is NA
  • Setting na.rm = TRUE drops missing values before computing the mean
mean(anchoring$height)
[1] NA
var(anchoring$height)
[1] NA
mean(anchoring$height, na.rm = TRUE)
[1] 67.54545
var(anchoring$height, na.rm = TRUE)
[1] 19.74504

Testing for Equality

The double equals == operator returns TRUE if both sides are equal to each other, FALSE otherwise

0.5 == 1/2
[1] TRUE
'cat' == 'dog'
[1] FALSE

Select Elements with TRUE/FALSE

  • Use c() to create a vector of elements
  • Use brackets [] to select elements of a vector
x <- c('BEE', 'is', 'the', 'best')
some_words <- c(TRUE, FALSE, FALSE, TRUE) 
x[some_words]
[1] "BEE"  "best"

Comparing Means Across Groups

Calculate the difference of mean answers: “Hi” minus “Lo”

Hi <- anchoring$rand.num == 65 # TRUE if Hi group, FALSE otherwise
mean_Hi <- mean(anchoring$africa.percent[Hi]) 

Lo <- anchoring$rand.num == 10 # TRUE if Lo group, FALSE otherwise
mean_Lo <- mean(anchoring$africa.percent[Lo]) 

mean_Hi - mean_Lo
[1] 13.62437

A Two-sample t-Test

  • t.test() to build a confidence interval or carry out a t-test
  • For a two-sample test provide two vectors x and y with data for the two groups.
  • Defaults to 95% CI and does not assume equal variances across groups
  • For more information: ?t.test()
  • What do we conclude?
t.test(x = anchoring$africa.percent[Hi], 
       y = anchoring$africa.percent[Lo])

    Welch Two Sample t-test

data:  anchoring$africa.percent[Hi] and anchoring$africa.percent[Lo]
t = 4.9704, df = 73.27, p-value = 4.252e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  8.161655 19.087081
sample estimates:
mean of x mean of y 
 30.71739  17.09302 

Testing a Theoretical Model

The Common Value Auction1

  • Two players, each shown private signal: \(V_i \sim \text{Unif}(0, L)\).
  • Each bids for a cash prize with value \(\bar{V} \equiv (V_1 + V_2)/2\).
  • Highest bidder wins; receives \(\bar{V}\) minus their bid
  • Nash Equilibrium bid function (risk neutral): \(V_i / 2\).
  • Are experimental participants subject to a “winners curse?” In other words, do they systematically overbid?

Loading Data from Excel

  • Sometimes you have to read data from excel (SHUDDER)
  • Rather than converting to .csv in Excel, load the raw .xls or .xlsx file into R with the readxl package.
  • Install (first time only): install.packages('readxl')
  • Load (every time): library(readxl)

Download the Data

library(readxl)
setwd('~/Dropbox/oxford-tutorials/BEE-MT-2023/data-analysis-lecture/')
common_value <- read_excel('common-value-auction.xlsx')

Experimental Data

  • Session: experimental session (1-8)
  • Month: calendar month (D for December, J for January)
  • Range: the value of \(L\) for the \(\text{Uniform}(0,L)\) signals (5 or 10)
  • ID_self: participant ID (unique within a given session)
  • ID_partner: ID of partner for the common value auction
  • Signal: the \(\text{Uniform}(0,L)\) private signal
  • Bid: the player’s bid
head(common_value)
# A tibble: 6 × 8
  Session Month Range Round ID_self ID_partner Signal   Bid
    <dbl> <chr> <dbl> <dbl>   <dbl>      <dbl>  <dbl> <dbl>
1       1 D        10     1       1         10   8.97  4   
2       1 D        10     1       2          3   6.27  3   
3       1 D        10     1       3          2   2.06  2   
4       1 D        10     1       4          7   0.39  0.6 
5       1 D        10     1       5          9   3.15  1.02
6       1 D        10     1       6          8   0.58  1   

Note

  • Each student participated in multiple experimental rounds
  • Random (re-)matching of partners each round
  • I keep only round 1 data to exclude “learning” or “feedback”

Plotting a Histogram with hist()

  • Make histograms for sessions with \(L = 10\) and \(L = 5\)
  • Notice anything strange?
L5 <- common_value$Range == 5
L10 <- common_value$Range == 10
par(mfrow = c(1, 2)) # put the plots side-by-side
hist(common_value$Signal[L5], xlab = 'Signal', main = 'L = 5')
hist(common_value$Signal[L10], xlab = 'Signal', main = 'L = 10')
par(mfrow = c(1, 1)) # go back to the default plotting layout 

Selecting and Rows from a Dataframe

Something weird is going on here. Let’s take a closer look:

weird <- (common_value$Range == 5) & (common_value$Signal > 5)
common_value[weird, ]
# A tibble: 5 × 8
  Session Month Range Round ID_self ID_partner Signal   Bid
    <dbl> <chr> <dbl> <dbl>   <dbl>      <dbl>  <dbl> <dbl>
1       7 J         5     1       1         10   5.48  5.02
2       7 J         5     1       2          7   5.24  3   
3       7 J         5     1       3          8   8.53  4.5 
4       7 J         5     1       4          9   7.94  8   
5       7 J         5     1      10          3   9.05  4.5 

Dropping Rows from a Dataframe

  • Something is wrong with a few signals from session 7.
  • To be on the safe side, I will drop that whole session.
  • (You would need to explain this in your report!)
session7 <- common_value$Session == 7
common_value_cleaned <- common_value[!session7,]

Making a Scatterplot

  • plot(y ~ x) makes a scatterplot of y against x
  • Plot bids against signals; each point is an experimental participant
  • abline(a = INTERCEPT, b = SLOPE) adds a line to an existing plot
  • Overlay the line \(\text{Bid} = \text{Signal} / 2\). What do we observe?
plot(Bid ~ Signal, data = common_value_cleaned,  
     main = 'Common Value Auction: Nash = Dashed Red')
abline(a = 0, b = 0.5, lwd = 2, lty = 2, col = 'red')

Running a Regression

  • lm(y ~ x, data = [DATAFRAME_HERE]) runs the regression
  • summary([REGRESSION_HERE]) displays the results
regression_results <- lm(Bid ~ Signal, data = common_value_cleaned)
summary(regression_results)

Call:
lm(formula = Bid ~ Signal, data = common_value_cleaned)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.3743 -0.8351 -0.1894  0.6853  3.7839 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.11375    0.18797   5.925 2.87e-08 ***
Signal       0.48724    0.04224  11.534  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.184 on 124 degrees of freedom
Multiple R-squared:  0.5176,    Adjusted R-squared:  0.5137 
F-statistic:   133 on 1 and 124 DF,  p-value: < 2.2e-16

Regression Line vs Nash Equilibrium

plot(Bid ~ Signal, data = common_value_cleaned)
abline(a = 1.114, b = 0.487)  
abline(a = 0, b = 0.5, lwd = 2, lty = 2, col = 'red')

If you insist: F-test of Nash Prediction

  • I prefer to look at estimates and standard errors…
  • Confused? Try this blog post
library(car)
linearHypothesis(regression_results, 
                 c('Signal = 0.5', '(Intercept) = 0'))
Linear hypothesis test

Hypothesis:
Signal = 0.5
(Intercept) = 0

Model 1: restricted model
Model 2: Bid ~ Signal

  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    126 317.21                                  
2    124 173.71  2    143.51 51.222 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1