All About That Base (R)

A Crash Course in R Programming

Francis J. DiTraglia

University of Oxford

Introduction

What is this?

  • Four 80 minute sessions on base R programming.
  • Base R means no bells or whistles: no packages 🔕
  • Start by doing things “the hard way” 🔩 🔨 😫
  • (We have calculators, but kids still need to learn addition.)

⚠️ Health Warning

  • I assume most of you have some programming knowledge
  • E.g. you understand variables, if-then, for/while etc.
  • I do not assume prior knowledge of R.
  • This means I move quickly / focus on how things work in R.
  • If your programming background is weak/rusty, you can catch up but need to put in the hours right now.

I can’t cover everything

Part 1 - Getting Started with RStudio

What you’ll see when you load a new project on posit cloud. There are three “panes”: the console (left), environment (top right), and file browser (bottom right)

Type commands at the console and hit return to run them. Objects appear in the environment pane. Here we have one object x and it equals 3.

Commands typed directly into the console are not reproducible, so we’ll store and run our R commands from a script: select File > New > R Script. Alternatively: Ctrl+Shift+N

A new pane has appeared: the editor pane. This is a text editor where we can enter and save R commands. (Yes, there are vim and emacs keybindings!)

A text file with a .R extension containing R commands is called an R script. Make sure to save your work! Your file will appear in the file browser.

Put your cursor on the line you want to run then type Ctrl-Return (Command-Return on Mac) to run it. Alternatively click “Run” or “Source” to run the whole script.

Getting weird results? Clear your environment by entering rm(list = ls()) or clicking on the broom icon on the environment pane. Then re-run your script from the top.

Search the help files using ? or the help tab in the bottom right pane. Be warned: the help files aren’t always user-friendly 😵

Never underestimate the power of googling the error message!

DuckDuckGo is the connoisseur’s choice 🍷

If you’re not already, you will soon become very familiar with StackOverflow. Limit your results to R by starting your search string with [r]

chatGPT is incredibly good at programming but occasionally gives you 💩 so be vigilant!

From now on

  • I will present R code examples and output on the slides.
  • Type the code into your R script for today’s lecture, run it and check that you get the same output. (Save your work!)
  • We will pause periodically so you can work on an exercise and then discuss our solutions as a class.
  • Raise your hand if you need help.
greeting <- 'Hello core ERM!'
print(greeting)
[1] "Hello core ERM!"

Part 2 - R Basics

A matter of style 😎

  • Good programming style is like good spelling and grammar: we want to make it easy for others to understand us.
  • When programming in R, I follow the tidyverse style guide and you should too: this is one of the marking criteria.
  • Follow my lead when it comes to spacing, indentation, etc. You’ll get the hang of it in no time.

Naming Objects

  • Can contain letters, digits, dot ., and underscore _
  • Can only start with a letter or the dot .
  • Don’t use a name that’s already an R command!
  • Tidyverse Style Guide
    • Use only lowercase letters, numbers, and _
    • snake_case using to separate words
    • Keep it concise; make it meaningful

Comments

  • Any line of R code that begins with # is a comment
  • Comments are for humans not the computer!
  • Tidyverse style guide
    • Comment lines should begin with # and then a space
    • Capitalize. Put a period if at least two sentences.
    • Explain the “why” rather than the “what” or “how”
    • Too many comments \(\implies\) unclear code; try again!

What does ⬅️ mean?

  • R assignment operator: less than sign followed by a hyphen.
  • x <- 3 creates an object named x with the value 3
  • RStudio Shortcut: ALT “minus”
  • Advanced R: Names and Values for some subtleties.
  • Pronounced gets as in “x gets 3”.
  • Single equals sign (=) also works but it’s bad style.

The Assignment Operator

x <- 3
x
[1] 3
# Bad style, but it works
y = 5
y
[1] 5
# Also bad style, but this works too! 
2 -> z
z
[1] 2

R as an overgrown calculator

  • Basic arithmetic with +, -, *, / and parentheses
(3 + 5) * 3 - (2 / 7)
[1] 23.71429
  • ⚠️ Implied multiplication doesn’t work; * is mandatory
x <- 3
2x
Error: <text>:2:2: unexpected symbol
1: x <- 3
2: 2x
    ^
2 * x
[1] 6

R as an overgrown calculator

  • Exponentiation with ^ and \(e^x\) with exp(x)
2^8
[1] 256
exp(1)
[1] 2.718282
  • \(\log_b a\) with log(a, base = b), defaults to natural log
log(100, base = 10)
[1] 2
log(100)
[1] 4.60517

Data Types in R

  • Numeric Types
  • Character: a chunk of text within single/double quotes
    • x <- 'hello world!' or y <- "ERM rocks!"
  • Logical: TRUE or FALSE
  • typeof() tells you the type

Examples with Types

x <- 3.14159
typeof(x)
[1] "double"
y <- 10L
z <- 10
typeof(y)
[1] "integer"
typeof(z)
[1] "double"
economist_buried_nearby <- 'Francis Ysidro Edgeworth'
typeof(economist_buried_nearby)
[1] "character"
typeof(TRUE)
[1] "logical"

Logical Operators

  • NOT: ! turns a TRUE into a FALSE and vice-versa
  • AND: x & y is TRUE iff both x and y are TRUE
  • OR: x | y is TRUE iff at least one of x and y is TRUE
  • XOR: xor(x, y) is TRUE iff exactly one of x and y is TRUE

Logical Operators: Examples

!TRUE
[1] FALSE
TRUE & FALSE
[1] FALSE
TRUE & TRUE
[1] TRUE
TRUE | FALSE
[1] TRUE
TRUE | TRUE
[1] TRUE
xor(TRUE, FALSE)
[1] TRUE
xor(TRUE, TRUE)
[1] FALSE

Relational Operators

  • These operators return TRUE or FALSE
  • <, >, <=, and >= all mean what you think they do
  • == tests for equality; don’t confuse it with =
  • != tests for lack of equality

Relational Operators: Examples

5 > 3
[1] TRUE
2 < 2
[1] FALSE
7 <= 7
[1] TRUE
9 >= 9
[1] TRUE
-1 == -1
[1] TRUE
5 == 2
[1] FALSE
0 != 0
[1] FALSE
75 != 42
[1] TRUE

⚠️ == and != can be Dangerous

  • If you need a single TRUE / FALSE value use identical()
  • Testing equality of doubles can give unexpected results.
  • To test for near equality use all.equal()

NAs and Infs and NaNs

  • NA means not available / missing
  • Inf means infinity and -Inf means minus infinity
  • NaN means not a number
  • You can do math with all of these in R

NAs, Infs, and NaNs: Examples

NA + 1 # NAs are *contagious*
[1] NA
2 * Inf
[1] Inf
-1 / 0
[1] -Inf
1 / Inf
[1] 0
0 / 0
[1] NaN

Testing the type of an object

is.integer(3L)
[1] TRUE
is.integer(3)
[1] FALSE
is.na(4)
[1] FALSE
is.na(NA)
[1] TRUE
is.character('Hi mom!')
[1] TRUE
is.character(NA)
[1] FALSE
is.double(3)
[1] TRUE
is.double(3L)
[1] FALSE

💪 Exercise A - (5 min)

  1. Why does this code throw an error? Try to fix it.
x <- 3; x > 2 & < 9
  1. Does (NA & TRUE) equal (NA | TRUE)? Explain.
  2. Does (Inf - Inf) equal (Inf - 1)? Explain.
  3. Run the following. What happens? (further reading)
y <- (1 - 0.8); z <- 0.2
y == z; y < z; all.equal(y, z); identical(y, z)
  1. Why do I use double quotes here?
important_message <- "The harder you try, the more you'll learn."

Part 3 - Atomic Vectors

R’s Simplest Data Structure

  • Atomic vector: one-dimensional, stores data of a single type
  • Futher reading: Chapter 3 - “Vectors” from Advanced R
  • Key types: double, integer, character, logical
  • c() “concatenate” to create an atomic vector
  • typeof() to find out the type of an atomic vector
  • length() to find the length of an atomic vector

Atomic Vectors: Some Examples

x <- c(2.3, -1.7, 5, 0, 6.2, 7)
y <- c('core ERM', 'is', 'the best course', 'in history') 
z <- c(TRUE, FALSE, TRUE) 

length(x)
[1] 6
length(y)
[1] 4
length(z)
[1] 3
typeof(x)
[1] "double"
typeof(y)
[1] "character"
typeof(z)
[1] "logical"

💪 Exercise B - (1 minute)

Predict the result that you will obtain if you use typeof() to find the type of each of the following atomic vectors. Then check to see if you were right!

foo <- c('1', '2', '3')
bar <- c('TRUE', 'FALSE')

Subsetting Atomic Vectors

  • Use brackets [] to access elements of an atomic vector.
  • Various ways to subset:
    • With an index x[2] or vector of them x[c(2, 5, 7)]
    • With a logical vector of the same length as x
  • Negative indices remove elements x[-1]
  • Repeated indices are allowed x[c(2, 2)]

Warning

R (like Julia) indexes from one, unlike Python and C/C++ which index from zero.

Subsetting Atomic Vectors: Example

y <- c('Keble', 'LMH', 'Univ', 'Merton') 
y[1]
[1] "Keble"
y[c(1, 4)]
[1] "Keble"  "Merton"
y[c(FALSE, TRUE, FALSE, TRUE)]
[1] "LMH"    "Merton"
y[-c(1, 4)]
[1] "LMH"  "Univ"
y[-3]
[1] "Keble"  "LMH"    "Merton"
y[c(-1, -4)]
[1] "LMH"  "Univ"
y[c(1, 2, 1, 2)]
[1] "Keble" "LMH"   "Keble" "LMH"  

R has no scalars; only 1-vectors

Did you notice the [1] that keeps appearing everywhere?

2^8
[1] 256

w is a vector; [1] denotes its first (and only) element.

w <- 2^8
length(w)
[1] 1

Here the first element [1] is 30 and the 26th [26] is 55:

r <- 30:70 # shorthand for the vector c(30, 31, ..., 69, 70)
r
 [1] 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
[26] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70

💪 Exercise C - (5 minutes)

y <- c('Keble', 'LMH', 'Univ', 'Merton')
  1. Enter the command y[5]. What result do you get? Why?
  2. I want to extract the second and fourth elements of y so I enter y[2,4]. What happens? Can you fix it? How?
  3. Select 'Keble' and 'Univ' two different ways.
  4. Below is a vector of sales in $ over several months. Using [], length() and -, compute the monthly growth rates in %
sales <- c(100, 120, 90, 110, 105, 130, 140, 135, 125, 145, 150, 160)

Doing math with vectors

Mathematical operations in R are vectorized and operate element by element:

x <- 5:9 # shorthand for c(5, 6, 7, 8, 9)
y <- 0:4 # shorthand for c(0, 1, 2, 3, 4)
x + y
[1]  5  7  9 11 13
x - y
[1] 5 5 5 5 5
x * y
[1]  0  6 14 24 36
y / x
[1] 0.0000000 0.1666667 0.2857143 0.3750000 0.4444444
x^y
[1]    1    6   49  512 6561

Some vectorized functions

Nearly all R Functions are vectorized: they accept vector input

x <- -2:2  # shorthand for c(-2, -1, 0, 1, 2)
sum(x) # sum
[1] 0
prod(x) # product
[1] 0
abs(x) # absolute value
[1] 2 1 0 1 2
sqrt(x^2) # (positive) square root
[1] 2 1 0 1 2
min(x) # minimum
[1] -2
max(x) # maximum 
[1] 2

Relational / Logical Operators are Vectorized

x <- c(1, 2, 3, 4)
y <- c(-1, 3, 2, 6)
x > y
[1]  TRUE FALSE  TRUE FALSE
z <- c(TRUE, FALSE, TRUE)
w <- c(FALSE, TRUE, TRUE)
z | w
[1] TRUE TRUE TRUE
z & w
[1] FALSE FALSE  TRUE
any(x > y) # checks if *any* element is true
[1] TRUE
all(x > y) # checks if *all* elements are true
[1] FALSE

♻️ Recycling

Allows operations with vectors of different lengths, e.g. “scalars” with vectors:

x <- -2:2 # shorthand for c(-2, -1, 0, 1, 2)
x + 3
[1] 1 2 3 4 5
x - 1
[1] -3 -2 -1  0  1
2 * x
[1] -4 -2  0  2  4
x / 2
[1] -1.0 -0.5  0.0  0.5  1.0
x^3
[1] -8 -1  0  1  8

♻️ Recycling Rules

  1. If any vector has length zero, the result will too.
x <- numeric(length = 0) # creates a zero length numeric vector
x
numeric(0)
length(x)
[1] 0
x + -2:2
numeric(0)
  1. Length of result equals length of the longer vector.
c(1, 2) + c(5, 6, 7, 8) # equivalent to c(1, 2, 1, 2) + c(5, 6, 7, 8)
[1]  6  8  8 10

⚠️ Recycling Can Be Dangerous

  • R warns you if the length of the longer vector isn’t a multiple of the length of the shorter one:
c(1, 2, 3) + c(5, 6, 7, 8) # equiv. to c(1, 2, 3, 1) + c(5, 6, 7, 8)
Warning in c(1, 2, 3) + c(5, 6, 7, 8): longer object length is not a multiple
of shorter object length
[1]  6  8 10  9
  • Advice: avoid recycling except for “scalar with vectors” case unless you really know what you are doing.

💪 Exercise D - (3 min)

The probability mass function of a Binomial\((n, p)\) random variable is given by \[ \mathbb{P}(X=x) = \binom{n}{x} p^x (1 - p)^{n-x} \] Use vectorized mathematical operations and the choose() function to calculate the pmf of a Binomial\((5, 0.3)\) random variable in one fell swoop.

Replacing elements of a vector

  • LHS: describe values you want to modify
  • RHS: use <- to overwrite
x <- c(1, 3, 7, 8, -4)
x
[1]  1  3  7  8 -4
x[4] <- 5
x
[1]  1  3  7  5 -4
x[x < 0] <- NA
x
[1]  1  3  7  5 NA
x[c(1, 3)] <- c(-4, 7)
x
[1] -4  3  7  5 NA

Naming elements of a vector

Method 1: create first, then use names()

Frank <- c(1983, 39, 1)
names(Frank) <- c('birth year', 'age', '#siblings')
Frank
birth year        age  #siblings 
      1983         39          1 

Method 2: name when creating

Frank <- c('birth year' = 1983, 'age' = 39, '#siblings' = 1)
Frank
birth year        age  #siblings 
      1983         39          1 

You can rename with names():

Frank[3] <- 0
names(Frank)[3] <- '#Nobel Prizes'

Accessing Elements by Name

Frank <- c('birth year' = 1983, 'age' = 39, '#siblings' = 1)
Frank['age']
age 
 39 
Frank['birth year']
birth year 
      1983 
Frank[c('age', 'age', 'birth year')] 
       age        age birth year 
        39         39       1983 
Frank[c('salary', 'zodiac sign')]
<NA> <NA> 
  NA   NA 

Set Operations

  • NULL is the empty set. You can assign it, e.g. x <- NULL
  • union(A, B) \(\equiv A \cup B\)
  • intersect(A, B) \(\equiv A\cap B\)
  • setdiff(A, B) \(\equiv A \setminus B \equiv A - B \equiv A \cap B^{c}\)
  • setequal(A, B) is TRUE iff \(A \subseteq B\) and \(B \subseteq A\)
  • A %in% B returns a vector of length(A) with TRUE for each element of A that is contained in B, FALSE otherwise

Set Operations: Examples

A <- 1:5; B <- 7:3; C <- NULL
union(A, B)
[1] 1 2 3 4 5 7 6
union(A, C) 
[1] 1 2 3 4 5
intersect(A, B)
[1] 3 4 5
intersect(A, C)
NULL
setdiff(A, B)
[1] 1 2
setequal(A, B)
[1] FALSE
A %in% B
[1] FALSE FALSE  TRUE  TRUE  TRUE

Coercion

  • Remember: an atomic vector stores a single type
  • If you try to put in multiple types, R converts: coercion
  • Precedence: character > numeric > logical
  • Highest precedence type “wins”; everything else coerced:
x <- c(1, TRUE, '1')
x
[1] "1"    "TRUE" "1"   

Note

To coerce manually: as.character(), as.numeric(), as.logical()

Examples of Coercion

y <- c(1, TRUE, 5, 0)
y
[1] 1 1 5 0
as.character(1:5)
[1] "1" "2" "3" "4" "5"
as.numeric(c('1', '2', '3'))
[1] 1 2 3
# Handy for mathematics:
x <- c(TRUE, FALSE, TRUE)
sum(x)
[1] 2
prod(x)
[1] 0

💪 Exercise E - (5 min)

  1. Replace all of the 999s in this vector with NAs
x <- c(5, 10, 3, 7, 999, 2, 999, 17, 0)
  1. In a deck of Italian playing cards, the face cards are fante (Knave), cavallo (Knight), and re (King). In the game Scopa, fante is worth 8, cavallo 9, and re 10. Convert cards to the appropriate numeric values.
cards <- c('re', 'cavallo', 're', 'fante', 'cavallo', 'fante', 're')
  1. This code throws an error. Coerce y to make it work.
y <- c('1', '2', '3')
sum(y)
  1. What happens if you run as.logical(-2:2)? Can you figure out the coercion rule for numeric to logical?

Part 3 - Functions and Control Flow

📚 References

Start with Hands-On Programming with R. For more:

Functions: What, Why and When?

  • Function: a chunk of code that encapsulates a particular task, so you never have to program it “from scratch” again!
  • Abstraction: the user (including the future you) only needs to know the inputs and outputs, not how the function works.
  • Advantages:
    • Increased productivity: don’t repeat yourself! Automate!
    • Fewer copy-paste errors, code that is easier to understand

A good rule of thumb is to consider writing a function whenever you’ve copied and pasted a block of code more than twice. – Hadley Wickham

Example: Computing a z-score

  • Recall the definition of a sample z-score: \[ z_i \equiv \frac{x_i - \bar{x}}{s}, \quad \bar{x} \equiv \frac{1}{n} \sum_{i=1}^n x_i, \quad s \equiv \sqrt{\frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}} \]
  • R has a function called scale() to compute z-scores.
  • But suppose it didn’t have such a function!
  • We compute z-scores all the time.
  • Avoid doing it from scratch constantly

Function Ingredients: z-score

z_score <- function(x) {
# Center and standardize a numeric vector x, returns z-scores
  (x - mean(x)) / sd(x)
}
  1. Give it a meaningful name: z_score <-
    • “Hey R, I’m creating something called z_score.”
  2. Specify arguments: function(x)
    • “And this something is a function with one input: x.”
  3. Enclose function body in curly brackets: { ... }
    • Notice the space after function() and the linebreaks.
    • Whitespace in R is not syntactic, but this is good style.

Function Ingredients: z-score

z_score <- function(x) {
# Center and standardize a numeric vector x, returns z-scores
  (x - mean(x)) / sd(x)
}
  1. Write the function body
    • This is the code that actually does the work
    • Good to add comment explaining purpose, inputs, output
  2. By default, R returns the last evaluated expression
    • return() is bad style; reserve for “early returns”

Using the z_score() function

z_score <- function(x) {
# Center and standardize a numeric vector x, returns z-scores
  (x - mean(x)) / sd(x)
}

example_data <- c(-2, 6, 3, -1, 7, 8, 0, 4, 3, -5)
z <- z_score(example_data)
z
 [1] -1.0195160  0.8772580  0.1659677 -0.7824193  1.1143547  1.3514514
 [7] -0.5453225  0.4030645  0.1659677 -1.7308062
  • Function Defn: z_score <- function(x) { ... }
  • Function Call: z_score(example_data)

\() is Shorthand for function()

  • R Version 4.1 introduced \() shorthand for function()
z_score <- \(x) {
  (x - mean(x)) / sd(x)
}

z_score(example_data)
 [1] -1.0195160  0.8772580  0.1659677 -0.7824193  1.1143547  1.3514514
 [7] -0.5453225  0.4030645  0.1659677 -1.7308062
  • Convenient when creating anonymous functions, i.e. functions with no name. (More in a later lecture)

A Function with Multiple Arguments

The \(k\)th raw moment of a random variable is \(\mathbb{E}[X^k]\). The sample analogue is \(\frac{1}{n} \sum_{i=1}^n x_i^k\).

raw_moment <- function(x, k) {
  mean(x^k)
}

example_data <- c(-2, 6, 3, -1, 7, 8, 0, 4, 3, -5)

raw_moment(example_data, 1)
[1] 2.3
raw_moment(example_data, 2)
[1] 21.3
raw_moment(example_data, 3)
[1] 105.5

Matching Args by Name vs Position

raw_moment(example_data, 1)
[1] 2.3
raw_moment(x = example_data, k = 1)
[1] 2.3
raw_moment(k = 1, x = example_data)
[1] 2.3
raw_moment(1, example_data) # Oops!
[1] 1

💪 Exercise F - (10 min)

  1. Call z_score(w) where w <- c(1, 2, NA). What happens? See ?mean().
  2. Test out this function. What happens? Now try adding return(z) at the bottom of the function body. Explain your results.
bad_z_score <- function(x) {
  z <- (x - mean(x)) / sd(x)
}
  1. Write a function to compute skewness using sum(), length(), mean() and sd(). \[ \text{Skewness} \equiv \frac{1}{n} \sum_{i=1}^n\left( \frac{x_i - \bar{x}}{s}\right)^3. \]
  2. Use sum(), length() and is.na() to write a function called my_var() that drops NAs and then computes the sample variance.
  3. Write a function called summary_stats() that returns a named vector with two elements: the sample mean and standard deviation.

dot-dot-dot ...

  • Special arg that can be included in function definitions
  • Allows user to supply any number of “extra” args in call
  • These args can be “passed through” to functions in the body
  • See Advanced R Chapter 6.6 for more details.
raw_moment <- function(x, k, ...) {
  mean(x^k, ...)
}
some_data <- c(1, 2, 3, NA)

raw_moment(some_data, 1)
[1] NA
raw_moment(some_data, 1, na.rm = TRUE)
[1] 2

🔭 Scope

🔭 Scope: Examples

k is defined in the “global environment” so f() “can see it”

k <- 3
f <- \(x) {
  x^k
}  
f(2)
[1] 8

m is defined inside g() so the global environment “can’t see it”

g <- \(x) {
  m <- 2
  x^m
}
5^m
Error in eval(expr, envir, enclos): object 'm' not found

🔭 Scope: Another Example

x <- 0.5
h <- \(x) {
  sin(pi * x) # pi is a built-in constant in R
}
h(2) # Returns sin(2 * pi), not sin(pi * 0.5)
[1] -2.449294e-16
  • Here there are two objects named x.
  • h() looks inside the function first and finds x
  • Since it found x it stops looking.
  • Only checks global env if it doesn’t find x in h()

if() statements

If LOGICAL_CONDITION is TRUE, run code inside { ... }

if(LOGICAL_CONDITION) {
  # Your code goes here
}

Examples:

if(3 > 5) {
  print('Everything you know is wrong!')
}
my_name <- 'Frank'
if(identical(my_name, 'Frank')) {
  print('Hi Frank!')
}
[1] "Hi Frank!"

Warning

LOGICAL_CONDITION must be length one: an individual TRUE of FALSE value.

Early Returns

“break out” of function early: before completing everything

f <- function(x) {
  if(length(x) < 2) {
    return('Cannot compute variance for fewer than two observations')
  }
  mean((x - mean(x))^2) / (length(x) - 1)
}

f(1:3)
[1] 0.3333333
f(2)
[1] "Cannot compute variance for fewer than two observations"

if()...else adds “default case”

if(LOGICAL_CONDITION) {
  # This code runs if LOGICAL_CONDITION is TRUE
} else {
  # This code runs if LOGICAL_CONDITION is FALSE 
}

Examples:

if(3 > 5) {
  print('Everything you know is wrong!')
} else {
  print('The laws of mathematics continue to apply.')
}
[1] "The laws of mathematics continue to apply."
my_name <- 'Sam'
if(identical(my_name, 'Frank')) {
  print('Hi Frank!')
} else {
  print('You should change your name to Frank.')
}
[1] "You should change your name to Frank."

if()...else if()...else()

if(A) {
  # Run if A is TRUE
} else if(B) {
  # Run if A is FALSE but B is TRUE
} else if(C) {
  # Run if A, B are both FALSE but C is TRUE
} else if(D) {
  # Run if A, B, C are all FALSE, but D is TRUE 
} else {
  # Run if A, B, C, D are all FALSE
}
  • Block: chunk of code inside of {...}
  • R starts at the top, looks for the first TRUE condition
  • Once it finds a TRUE, R skips the remaining blocks
  • If all conditions are FALSE, R runs else block, if present

Example: 3 Mutually Exclusive Cases

f <- \(x, y) {
  if(x > y) {
    print('x is bigger than y')
  } else if(x < y) {
    print('y is bigger than x') 
  } else {
    print('x is equal to y')
  }
}

f(x = 1, y = 2)
[1] "y is bigger than x"
f(x = 3, y = 4)
[1] "y is bigger than x"
f(x = 5, y = 5)
[1] "x is equal to y"

🌳 An overgrown if() tree

get_value <- function(x) {
  if(identical(x, 'queen')) {
    9
  } else if(identical(x, 'rook')) {
    5
  } else if(identical(x, 'knight')) {
    3 
  } else if(identical(x, 'bishop')) {
    3
  } else if(identical(x, 'pawn')) {
    1
  } else {
    NA
  }
}

🌳 Lookup Tables vs if() trees

get_value2 <- function(x) {
  values <- c(9, 5, 3, 3, 1)
  names(values) <- c('queen', 'rook', 'knight', 'bishop', 'pawn')
  values[x]
}

get_value('queen')
[1] 9
get_value2('queen')
queen 
    9 

Note

if() trees are best for running different code in each branch; lookup tables are best for assigning different values in each branch.

💪 Exercise G - (8 min)

  1. What happens if you run the following code? Why?
x <- c(TRUE, TRUE)
if(x) {
  print('hello world!')
}
  1. What happens if you run this code? Try to fix it.
if(3 > 5) {
 print('3 is greater than 5') 
}
else {
 print('3 is not greater than 5') 
}
  1. Write a function called mycov() that calculates the sample covariance between x and y. Use an early return to print an error message when x and y have different lengths.
  2. Consult ?trunc(). Then use trunc() to write a function called myround() that rounds x to the nearest integer.

Part 4 - Iteration

for() loops

Basic syntax:

for(INDEX in SOME_ATOMIC_VECTOR) {
  # Run this chunk of code 
  # (Typically involves the elements of SOME_ATOMIC_VECTOR)
}

Example:

for(i in 1:3) {
  print(i^2)
}
[1] 1
[1] 4
[1] 9

for() loop details

for(INDEX in SOME_ATOMIC_VECTOR) {
  # Run this chunk of code 
  # (Typically involves the elements of SOME_ATOMIC_VECTOR)
}
  • Each “pass” through the loop assigns next value to INDEX
  • R creates INDEX if it doesn’t exist; overwrites if it does.
i <- c('John', 'Paul', 'Ringo', 'George')
i
[1] "John"   "Paul"   "Ringo"  "George"
for(i in 1:2) {
  print(1 / i)
}
[1] 1
[1] 0.5
i
[1] 2

More for() loop details

for(INDEX in SOME_ATOMIC_VECTOR) {
  # Run this chunk of code 
  # (Typically involves the elements of SOME_ATOMIC_VECTOR)
}
  • INDEX created in environment where loop was called
i <- c('John', 'Paul', 'Ringo', 'George')

f <- \() { # A function with no arguments!
  for(i in 1:2) {
    print(1 / i)
  }
} 

f()
[1] 1
[1] 0.5
i
[1] "John"   "Paul"   "Ringo"  "George"

Even more for() loop details

for(INDEX in SOME_ATOMIC_VECTOR) {
  # Run this chunk of code 
  # (Typically involves the elements of SOME_ATOMIC_VECTOR)
}
  • for() can iterate over any type of atomic vector
  • We can call the index variable anything we like
personnel <- c('Ana', 'Frank', 'Inbar', 'Sam')
for(person in personnel) {
  print(paste0('Hi, ', person, '!'))
}
[1] "Hi, Ana!"
[1] "Hi, Frank!"
[1] "Hi, Inbar!"
[1] "Hi, Sam!"

“What happens in a for() loop stays in the for() loop.”

Why doesn’t anything happen?!

for(i in 1:5) {
  i^2
}

Store the results somewhere to access later:

# Create an empty atomic vector to store results
squares <- vector(length = 5) 

for(i in 1:5) {
  # Fill in the element i of the vector squares  
  squares[i] <- i^2
}

squares
[1]  1  4  9 16 25

while() loops

while(CONDITION) {
  # As long as CONDITION remains TRUE:
  # keep repeating this block of code
}
  • Use while() when you don’t know in advance how many iterations you’ll need.
  • Use for() when you do know in advance how many iterations you’ll need.

Example: How many coin flips are needed to get our first heads?

set.seed(1234) # We'll discuss this in Week 3

coin_flip <- 'Tails'
n_flips <- 0

while(identical(coin_flip, 'Tails')) {
  coin_flip <- sample(c('Heads', 'Tails'), size = 1)
  n_flips <- n_flips + 1
}

n_flips
[1] 5

We’ll usually avoid explicit loops

  • Vectorized functions / math operations replace many loops
  • Functional Programming: cleaner approaches to iteration
  • Loops “under the hood” but we don’t write them explicitly.
  • That said: sometimes there’s no way around writing a loop.

Are Loops in R Slow?

  • A loop itself is rarely the problem: it’s everything else
    • Needless work inside the loop
    • Lots of function calls
    • Making copies of objects
    • Nested loops
  • Writing faster code:
    • Preallocate objects you’ll “fill” with a loop
    • Think about what can be computed in advance of the loop
    • Prefer vectorized functions / matrix operations to loops

Recall from above:

get_value <- function(x) {
  if(identical(x, 'queen')) {
    9
  } else if(identical(x, 'rook')) {
    5
  } else if(identical(x, 'knight')) {
    3 
  } else if(identical(x, 'bishop')) {
    3
  } else if(identical(x, 'pawn')) {
    1
  } else {
    NA
  }
}
get_value2 <- function(x) {
  values <- c(9, 5, 3, 3, 1)
  names(values) <- c('queen', 'rook', 'knight', 'bishop', 'pawn')
  values[x]
}

A case study in writing faster code

Generate a character vector of 1 million chess pieces:

set.seed(1234) # We'll discuss in Week 3
chess_pieces <- sample(c('queen', 'rook', 'knight', 'bishop', 'pawn'),
                       size = 1e6, replace = TRUE)

Consider three methods to assign these pieces numeric values:

  1. A for() loop that repeatedly calls get_value() and doesn’t pre-allocate any memory to store the result.
  2. A for() loop that repeatedly calls get_value(), but does pre-allocated memory to store the result.
  3. Vectorized solution without loops or function calls: get_value2()

A case study in writing faster code

method1 <- function(x) {
  results <- NULL
  for(i in 1:length(x)) {
    results[i] <- get_value(x[i])
  }
  results
}

method2 <- function(x) {
  results <- vector(length = length(x))
  for(i in 1:length(x)) {
    results[i] <- get_value(x[i])
  }
  results
}

Note

Method 3 is simply get_value2() so I don’t need a third function.

Vectorized approach is much faster

system.time(
  results1 <- method1(chess_pieces)
)
   user  system elapsed 
  0.889   0.020   0.908 
system.time(
  results2 <- method2(chess_pieces)
)
   user  system elapsed 
  0.752   0.000   0.752 
system.time(
  results3 <- get_value2(chess_pieces)
)
   user  system elapsed 
  0.011   0.000   0.011 
names(results3) <- NULL
identical(results1, results2) & identical(results2, results3)
[1] TRUE

💪 Exercise H - (8 min)

  1. The Fibonacci Sequence is defined by \(F_1 = 1\), \(F_2 = 1\) and \(F_n = F_{n-1} + F_{n-2}\) for \(n > 2\). Write a function that uses a for() loop to compute first n Fibonacci numbers.
  2. Come up with a way to generate the same output as f() without using a loop or if() ... else.
f <- \(x) {
  for(j in 1:length(x)) {
    if(x[j] > 0) {
      x[j] <- x[j]^3 + x[j]
    } else {
      x[j] <- x[j]^2 - x[j]
    } 
  }
  x
}

Part 5 - Matrices and Arrays

Attributes

  • “Extra information” that can be “attached” to an R object.
    • attributes(x) to view the attributes of x
    • attributes(x) returns NULL if x has no attributes
    • names() are an example of an attribute
x <- 1:5
attributes(x)
NULL
Frank <- c('birth year' = 1983, 'age' = 39, '#siblings' = 1) 
attributes(Frank)
$names
[1] "birth year" "age"        "#siblings" 

A Matrix is a Vector with Dimensions

  • Another important attribute is dimension: dim()
  • Atomic vectors have no dimensions:
x <- 1:6
dim(x)
NULL
  • If we give a vector dimensions, it becomes a matrix:
dim(x) <- c(3, 2) # ([number of rows], [number of columns])
x
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

Key Points About Matrices / Arrays

  • A matrix is just an atomic vector with dimensions!
  • Matrices inherit key properties of atomic vectors:
    • Only store elements of a single type
    • Element-wise mathematical operations
    • Vectorized functions often work on matrices
    • Recycling rules
  • An array is a higher-dimensional generalization of a matrix.
  • Matrix / array operations are usually very fast

Whatever happened to class?

  • Adding / changing dimensions doesn’t change the type
x <- 1:6
typeof(x)
[1] "integer"
dim(x) <- c(3, 2)
typeof(x)
[1] "integer"
  • But it can change the class
class(1:6)
[1] "integer"
class(x)
[1] "matrix" "array" 

⚠️ A Vector is Not a \((1\times n)\) Matrix

x <- 1:6
x
[1] 1 2 3 4 5 6
class(x)
[1] "integer"
dim(x) <- c(1, 6) # ([number of rows], [number of columns])
x
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    2    3    4    5    6
class(x)
[1] "matrix" "array" 

Warning

Some R functions / operations only work with matrices. A \((n\times 1)\) or \((1 \times n)\) matrix is not equivalent to an atomic vector. Remember: attributes and class.

R fills matrices by column.

  • One way to create a matrix is by setting dimensions:
x <- 1:9
dim(x) <- c(3, 3) # ([number of rows], [number of columns])
x
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
  • Another is by using matrix()
matrix(data = 1:9, nrow = 3, ncol = 3)
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

Filling a matrix by row

Set byrow = TRUE in matrix()

matrix(data = 1:9, nrow = 3, ncol = 3)
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
matrix(data = 1:9, nrow = 3, ncol = 3, byrow = TRUE)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9

ncol() and nrow()

These tell us how many rows / columns a matrix has:

pieces <- c('Queen', 'Rook', 'Knight', 'Bishop', 'Pawn', 'King')
M <- matrix(pieces, 2, 3)
M
     [,1]    [,2]     [,3]  
[1,] "Queen" "Knight" "Pawn"
[2,] "Rook"  "Bishop" "King"
nrow(M)
[1] 2
ncol(M)
[1] 3

This is the same information as dim()

dim(M)
[1] 2 3

Diagonal Matrices

  • For vector x, diag(x) constructs a diagonal matrix
M <- diag(1:4)
M
     [,1] [,2] [,3] [,4]
[1,]    1    0    0    0
[2,]    0    2    0    0
[3,]    0    0    3    0
[4,]    0    0    0    4
  • For matrix M, diag(M) extracts the main diagonal
diag(M)
[1] 1 2 3 4
  • For integer k, diag(nrow = k) is the identity matrix \(I_k\)
diag(nrow = 4)
     [,1] [,2] [,3] [,4]
[1,]    1    0    0    0
[2,]    0    1    0    0
[3,]    0    0    1    0
[4,]    0    0    0    1

Accessing Elements of a Matrix

Same idea as vectors but two dimensions [row, col]

M <- matrix(1:16, 4, 4)
M
     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16
M[3, 3]
[1] 11
M[1:2, 3:4]
     [,1] [,2]
[1,]    9   13
[2,]   10   14
M[c(TRUE, FALSE, FALSE, TRUE), c(FALSE, TRUE, TRUE, FALSE)]
     [,1] [,2]
[1,]    5    9
[2,]    8   12

Accessing Elements of a Matrix

Empty means everything from this dimension

M
     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16
M[1, ]
[1]  1  5  9 13
M[, 4]
[1] 13 14 15 16
M[, c(FALSE, TRUE, FALSE, TRUE)]
     [,1] [,2]
[1,]    5   13
[2,]    6   14
[3,]    7   15
[4,]    8   16

rbind() and cbind()

Create / expand a matrix by binding rows or columns

x <- 1:3 
y <- 4:6
z <- 7:9
M <- cbind(x, y, z)
M
     x y z
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
R <- rbind(x, y, z)
R
  [,1] [,2] [,3]
x    1    2    3
y    4    5    6
z    7    8    9
cbind(M, diag(nrow = 3))
     x y z      
[1,] 1 4 7 1 0 0
[2,] 2 5 8 0 1 0
[3,] 3 6 9 0 0 1
rbind(M, diag(1:3))
     x y z
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
[4,] 1 0 0
[5,] 0 2 0
[6,] 0 0 3

💪 Exercise I - (8 min)

  1. Create a \(5\times 5\) matrix called A, each of whose rows contains the elements 1:5. Hint: see ?rep.
  2. Display all elements of A except row 3 and column 2.
  3. Form a matrix B by stacking the \((4\times 4)\) identity matrix on top of itself.
  4. Display the seventh row of B.
  5. Write a function that uses a for() loop to construct the \((n\times n)\) exchange matrix \(J_n\).

Matrix Selection Gone Wrong

A failed attempt to produce the \((3\times 3)\) identity matrix:

n <- 3
eye <- matrix(0, n, n)
eye
     [,1] [,2] [,3]
[1,]    0    0    0
[2,]    0    0    0
[3,]    0    0    0
eye[1:n, 1:n] <- 1 
eye
     [,1] [,2] [,3]
[1,]    1    1    1
[2,]    1    1    1
[3,]    1    1    1

Oops! We wrote over the entire matrix by mistake!

Indexing a Matrix with Another Matrix

Instead: subset using a matrix of indices for the “target” matrix

n <- 3
eye <- matrix(0, 3, 3)

# 1st column of elements specifies the *row indices* of eye
# 2nd column of elements specifies the *col indices* of eye
elements <- cbind(1:n, 1:n) 
elements
     [,1] [,2]
[1,]    1    1
[2,]    2    2
[3,]    3    3
eye[elements] <- 1
eye
     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]    0    0    1

Elementwise Matrix Operations

Because a matrix is a vector with dimensions, +, -, *, and / are elementwise, just as they are for atomic vectors:

M <- matrix(1:4, 2, 2)
M
     [,1] [,2]
[1,]    1    3
[2,]    2    4
R <- matrix(c(0, -1, 1, 2), 
            2, 2)
R
     [,1] [,2]
[1,]    0    1
[2,]   -1    2
M + R
     [,1] [,2]
[1,]    1    4
[2,]    1    6
M - R
     [,1] [,2]
[1,]    1    2
[2,]    3    2
M * R
     [,1] [,2]
[1,]    0    3
[2,]   -2    8
M / R
     [,1] [,2]
[1,]  Inf    3
[2,]   -2    2

Matrix Products

(More on matrix algebra in R in future lectures)

v <- 1:2
A <- diag(v)
A
     [,1] [,2]
[1,]    1    0
[2,]    0    2
B <- matrix(-1, 2, 2)
B
     [,1] [,2]
[1,]   -1   -1
[2,]   -1   -1

Element-wise product: *

A * B
     [,1] [,2]
[1,]   -1    0
[2,]    0   -2

Usual matrix product: %*%

A %*% v # v is "promoted" 
     [,1]
[1,]    1
[2,]    4

Outer product: %o%

v %o% v
     [,1] [,2]
[1,]    1    2
[2,]    2    4

Matrices can have names

temps <- matrix(c(5, 2, 2,
                  12, 7, 8), nrow = 2, ncol = 3, byrow = TRUE)
rownames(temps) <- c('Aberdeen', 'Plymouth')
colnames(temps) <- c('Mon', 'Tues', 'Weds')
temps
         Mon Tues Weds
Aberdeen   5    2    2
Plymouth  12    7    8
rownames(temps)
[1] "Aberdeen" "Plymouth"
colnames(temps)
[1] "Mon"  "Tues" "Weds"

Subsetting Matrices by Name

temps
         Mon Tues Weds
Aberdeen   5    2    2
Plymouth  12    7    8
temps['Aberdeen', ]
 Mon Tues Weds 
   5    2    2 
temps[, 'Weds']
Aberdeen Plymouth 
       2        8 

Warning

There’s something funny about the second example: look closely!

drop() deletes “extra” dimensions

temps[, 'Weds']
Aberdeen Plymouth 
       2        8 
temps[, 'Weds', drop = FALSE]
         Weds
Aberdeen    2
Plymouth    8
M <- matrix(1:4, 1, 4)
M
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
dim(M)
[1] 1 4
class(M)
[1] "matrix" "array" 
v <- drop(M)
v
[1] 1 2 3 4
dim(v)
NULL
class(v)
[1] "integer"

💪 Exercise J - (15 min)

  1. Write a function to constructs the \((n\times n)\) exchange matrix \(J_n\) without using a loop.
  2. Compute the element-wise product of \(J_3\) with itself, and the square of \(J_3\), i.e. the ordinary matrix product \(J_3 J_3\).
  3. Let \(X\) be a Bernoulli\((0.2)\) and \(Y\) be a Binomial\((2, 0.5)\) RV. Construct a matrix p_XY that represents the joint pmf of \(X\) and \(Y\), under the assumption that \(X\) and \(Y\) are independent. Name the rows and columns.
  4. Consult ?rowSums() and ?colSums(). Then extract the marginal pmfs of \(X\) and \(Y\) from the matrix p_XY.

Part 6 - Lists and Dataframes

Atomic Vectors & Matrices / Arrays

  • Advantages
    • Mathematically convenient
    • Fast operations
    • Efficient use of memory
  • Disadvantages
    • Only store elements of one type
    • “Shape” must be rectangular

Lists: R’s Most General Data Structure

  • Reach for a list when you need to store:
    • more than one type
    • something with non-rectangular “shape”
  • Lists can store anything:
    • atomic vectors
    • matrices / arrays
    • even other lists!
  • Important special case of a list: data frame

List Basics

list() creates a list, just like c() creates an atomic vector

my_list <- list(c(TRUE, FALSE, FALSE), pi, diag(nrow = 2))
my_list
[[1]]
[1]  TRUE FALSE FALSE

[[2]]
[1] 3.141593

[[3]]
     [,1] [,2]
[1,]    1    0
[2,]    0    1

str() tells us what’s inside:

str(my_list)
List of 3
 $ : logi [1:3] TRUE FALSE FALSE
 $ : num 3.14
 $ : num [1:2, 1:2] 1 0 0 1

Accessing List Elements

  • This is complicated because lists are hierarchical
  • Single brackets []
    • Always returns a list
    • Otherwise same behavior as atomic vectors
  • Double brackets [[]]
    • Extract an object from the list by position
    • Doesn’t, in general, return a list
  • After extracting an object, index into it “as usual”

Examples of Accessing List Elements

my_list[1:2] # returns a list
[[1]]
[1]  TRUE FALSE FALSE

[[2]]
[1] 3.141593
my_list[3] # returns a list
[[1]]
     [,1] [,2]
[1,]    1    0
[2,]    0    1
my_list[[3]] # returns a matrix
     [,1] [,2]
[1,]    1    0
[2,]    0    1
my_list[[3]][1, 1] # (1, 1) element of the matrix my_list[[3]]
[1] 1

Named Lists

When creating a list, you can name the elements as with c()

core_ERM <- list('lecturer' = 'Frank', n_students = 70,
                 'TAs' = c('Ana', 'Inbar', 'Sam'))

Now we can access objects by name

core_ERM['lecturer'] # returns a list
$lecturer
[1] "Frank"
core_ERM[['lecturer']] # returns a vector
[1] "Frank"

$NAME_HERE is a shortcut for [['NAME_HERE']]

core_ERM$lecturer
[1] "Frank"

Data Frames

  • A special kind of list that is almost like a matrix.
  • Each element of the list is an atomic vector
  • These vectors don’t have to contain the same type
  • But they do have to be of the same length.
  • It’s basically a spreadsheet

Creating a Data Frame

A data frame is has type list and class data.frame

students <- data.frame('name' = c('Xerxes', 'Xanthippe', 'Xanadu'),
                       'age' = c(19, 23, 21),
                       'grade' = c(65, 70, 68),
                       'favorite_color' = c('blue', 'red', 'orange'))
students
       name age grade favorite_color
1    Xerxes  19    65           blue
2 Xanthippe  23    70            red
3    Xanadu  21    68         orange
typeof(students)
[1] "list"
class(students)
[1] "data.frame"

Accessing Elements of a Dataframe

We can mix-and-match selection rules for lists and matrices:

students 
       name age grade favorite_color
1    Xerxes  19    65           blue
2 Xanthippe  23    70            red
3    Xanadu  21    68         orange
students[1, 2] # The element in row 1 of column 2
[1] 19
students[, 2] # Everything in column 2
[1] 19 23 21
students[, 'favorite_color']
[1] "blue"   "red"    "orange"
students$name
[1] "Xerxes"    "Xanthippe" "Xanadu"   
students[students$name == 'Xerxes', ]
    name age grade favorite_color
1 Xerxes  19    65           blue

Why so few slides on data frames?

  • Data frames are a real pain!
  • We’ll work with something better: tibbles.
  • More in our next lecture.

💪 Exercise K - (7 min)

  1. I used students$name == 'Xerxes' above. Why didn’t I instead use identical(students$name, 'Xerxes')?

  2. Use the following code chunk to construct the employees data frame. Then display it.

employees <- data.frame(
  name = c("Alice", "Bob", "Cathy", "David", "Eva", 
           "Frank", "Grace", "Hank", "Ivy", "Jack"),
  age = c(25, 31, 28, 40, 35, 23, 30, 45, 33, 29),
  department = c("HR", "IT", "Finance", "IT", "HR", 
                 "Finance", "IT", "HR", "Finance", "IT"),
  salary = c(50000, 60000, 55000, 70000, 53000, 
             51000, 62000, 71000, 57000, 59000)
)
  1. Display the age column of employees.
  2. Display the sixth row of employees.
  3. Display the employee record for Eva.
  4. Display employee records for everyone in the IT department.
  5. Repeat the preceding, restricted to people with a salary of at least 60,000.