[1] 3
A Crash Course in R Programming
University of Oxford
., and underscore _._snake_case using _ to separate wordsx <- 3 creates an object named x with the value 3x gets 3”.=) also works but it’s bad style.+, -, *, / and parentheses* is mandatory^ and \(e^x\) with exp(x)log(a, base = b), defaults to natural logx <- 3L, for example x <- 3Lx <- 'hello world!' or y <- "ERM rocks!"TRUE or FALSEtypeof() tells you the type! turns a TRUE into a FALSE and vice-versax & y is TRUE iff both x and y are TRUEx | y is TRUE iff at least one of x and y is TRUExor(x, y) is TRUE iff exactly one of x and y is TRUETRUE or FALSE<, >, <=, and >= all mean what you think they do== tests for equality; don’t confuse it with =!= tests for lack of equality== and != can be DangerousTRUE / FALSE value use identical()all.equal()NAs and Infs and NaNsNA means not available / missingInf means infinity and -Inf means minus infinityNaN means not a numberNAs, Infs, and NaNs: Examples(NA & TRUE) equal (NA | TRUE)? Explain.(Inf - Inf) equal (Inf - 1)? Explain.c() “concatenate” to create an atomic vectortypeof() to find out the type of an atomic vectorlength() to find the length of an atomic vectorPredict the result that you will obtain if you use typeof() to find the type of each of the following atomic vectors. Then check to see if you were right!
[] to access elements of an atomic vector.x[2] or vector of them x[c(2, 5, 7)]xx[-1]x[c(2, 2)]Warning
R (like Julia) indexes from one, unlike Python and C/C++ which index from zero.
Did you notice the [1] that keeps appearing everywhere?
w is a vector; [1] denotes its first (and only) element.
Here the first element [1] is 30 and the 26th [26] is 55:
y[5]. What result do you get? Why?y so I enter y[2,4]. What happens? Can you fix it? How?'Keble' and 'Univ' two different ways.[], length() and -, compute the monthly growth rates in %Mathematical operations in R are vectorized and operate element by element:
Nearly all R Functions are vectorized: they accept vector input
Allows operations with vectors of different lengths, e.g. “scalars” with vectors:
numeric(0)
[1] 0
numeric(0)
Warning in c(1, 2, 3) + c(5, 6, 7, 8): longer object length is not a multiple
of shorter object length
[1] 6 8 10 9
The probability mass function of a Binomial\((n, p)\) random variable is given by \[
\mathbb{P}(X=x) = \binom{n}{x} p^x (1 - p)^{n-x}
\] Use vectorized mathematical operations and the choose() function to calculate the pmf of a Binomial\((5, 0.3)\) random variable in one fell swoop.
<- to overwriteMethod 1: create first, then use names()
birth year age #siblings
1983 40 1
Method 2: name when creating
birth year age #siblings
1983 40 1
You can rename with names():
NULL is the empty set. You can assign it, e.g. x <- NULLunion(A, B) \(\equiv A \cup B\)intersect(A, B) \(\equiv A\cap B\)setdiff(A, B) \(\equiv A \setminus B \equiv A - B \equiv A \cap B^{c}\)setequal(A, B) is TRUE iff \(A \subseteq B\) and \(B \subseteq A\)A %in% B returns a vector of length(A) with TRUE for each element of A that is contained in B, FALSE otherwiseNote
To coerce manually: as.character(), as.numeric(), as.logical()
999s in this vector with NAscards to the appropriate numeric values.y to make it work.as.logical(-2:2)? Can you figure out the coercion rule for numeric to logical?Start with Hands-On Programming with R. For more:
A good rule of thumb is to consider writing a function whenever you’ve copied and pasted a block of code more than twice. – Hadley Wickham
scale() to compute z-scores.z_score <-
z_score.”function(x)
x.”{ ... }
function() and the linebreaks.return() is bad style; reserve for “early returns”z_score() functionz_score <- function(x) {
# Center and standardize a numeric vector x, returns z-scores
(x - mean(x)) / sd(x)
}
example_data <- c(-2, 6, 3, -1, 7, 8, 0, 4, 3, -5)
z <- z_score(example_data)
z [1] -1.0195160 0.8772580 0.1659677 -0.7824193 1.1143547 1.3514514
[7] -0.5453225 0.4030645 0.1659677 -1.7308062
z_score <- function(x) { ... }z_score(example_data)\() is Shorthand for function()\() shorthand for function() [1] -1.0195160 0.8772580 0.1659677 -0.7824193 1.1143547 1.3514514
[7] -0.5453225 0.4030645 0.1659677 -1.7308062
The \(k\)th raw moment of a random variable is \(\mathbb{E}[X^k]\). The sample analogue is \(\frac{1}{n} \sum_{i=1}^n x_i^k\).
z_score(w) where w <- c(1, 2, NA). What happens? See ?mean().return(z) at the bottom of the function body. Explain your results.sum(), length(), mean() and sd(). \[
\text{Skewness} \equiv \frac{1}{n} \sum_{i=1}^n\left( \frac{x_i - \bar{x}}{s}\right)^3.
\]sum(), length() and is.na() to write a function called my_var() that drops NAs and then computes the sample variance.summary_stats() that returns a named vector with two elements: the sample mean and standard deviation....k is defined in the “global environment” so f() “can see it”
m is defined inside g() so the global environment “can’t see it”
x <- 0.5
h <- \(x) {
sin(pi * x) # pi is a built-in constant in R
}
h(2) # Returns sin(2 * pi), not sin(pi * 0.5)[1] -2.449294e-16
x.h() looks inside the function first and finds xx it stops looking.x in h()if () statementsIf LOGICAL_CONDITION is TRUE, run code inside { ... }
Examples:
if (3 > 5) {
print('Everything you know is wrong!')
}
my_name <- 'Frank'
if (identical(my_name, 'Frank')) {
print('Hi Frank!')
}[1] "Hi Frank!"
Warning
LOGICAL_CONDITION must be length one: an individual TRUE of FALSE value.
“break out” of function early: before completing everything
if ()...else adds “default case”Examples:
if (3 > 5) {
print('Everything you know is wrong!')
} else {
print('The laws of mathematics continue to apply.')
}[1] "The laws of mathematics continue to apply."
my_name <- 'Sam'
if (identical(my_name, 'Frank')) {
print('Hi Frank!')
} else {
print('You should change your name to Frank.')
}[1] "You should change your name to Frank."
if ()...else if ()...else(){...}TRUE conditionTRUE, R skips the remaining blocksFALSE, R runs else block, if presentif () treeif () treesget_value2 <- function(x) {
values <- c(9, 5, 3, 3, 1)
names(values) <- c('queen', 'rook', 'knight', 'bishop', 'pawn')
values[x]
}
get_value('queen')[1] 9
queen
9
Note
if () trees are best for running different code in each branch; lookup tables are best for assigning different values in each branch.
mycov() that calculates the sample covariance between x and y. Use an early return to print an error message when x and y have different lengths.?trunc(). Then use trunc() to write a function called myround() that rounds x to the nearest integer.for () loopsBasic syntax:
Example:
for () loop detailsINDEXINDEX if it doesn’t exist; overwrites if it does.for() loop detailsINDEX created in environment where loop was calledfor () loop detailsfor () can iterate over any type of atomic vectorfor () loop stays in the for () loop.”Why doesn’t anything happen?!
Store the results somewhere to access later:
while () loopswhile () when you don’t know in advance how many iterations you’ll need.for () when you do know in advance how many iterations you’ll need.Generate a character vector of 1 million chess pieces:
Consider three methods to assign these pieces numeric values:
for () loop that repeatedly calls get_value() and doesn’t pre-allocate any memory to store the result.for () loop that repeatedly calls get_value(), but does pre-allocated memory to store the result.get_value2()Note
Method 3 is simply get_value2() so I don’t need a third function.
user system elapsed
0.840 0.018 0.859
user system elapsed
0.71 0.00 0.71
user system elapsed
0.008 0.002 0.010
[1] TRUE
for () loop to compute first n Fibonacci numbers.f() without using a loop or if () ... else.attributes(x) to view the attributes of xattributes(x) returns NULL if x has no attributesnames() are an example of an attributedim()[1] 1 2 3 4 5 6
[1] "integer"
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 2 3 4 5 6
[1] "matrix" "array"
Warning
Some R functions / operations only work with matrices. A \((n\times 1)\) or \((1 \times n)\) matrix is not equivalent to an atomic vector. Remember: attributes and class.
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
matrix()Set byrow = TRUE in matrix()
ncol() and nrow()These tell us how many rows / columns a matrix has:
[,1] [,2] [,3]
[1,] "Queen" "Knight" "Pawn"
[2,] "Rook" "Bishop" "King"
[1] 2
[1] 3
This is the same information as dim()
x, diag(x) constructs a diagonal matrixM, diag(M) extracts the main diagonalk, diag(nrow = k) is the identity matrix \(I_k\)Same idea as vectors but two dimensions [row, col]
Empty means everything from this dimension
rbind() and cbind()Create / expand a matrix by binding rows or columns
A, each of whose rows contains the elements 1:5. Hint: see ?rep.A except row 3 and column 2.B by stacking the \((4\times 4)\) identity matrix on top of itself.B.for() loop to construct the \((n\times n)\) exchange matrix \(J_n\).A failed attempt to produce the \((3\times 3)\) identity matrix:
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 0 0
[3,] 0 0 0
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 1 1
[3,] 1 1 1
Oops! We wrote over the entire matrix by mistake!
Instead: subset using a matrix of indices for the “target” matrix
Because a matrix is a vector with dimensions, +, -, *, and / are elementwise, just as they are for atomic vectors:
(More on matrix algebra in R in future lectures)
Mon Tues Weds
Aberdeen 5 2 2
Plymouth 12 7 8
Mon Tues Weds
5 2 2
Aberdeen Plymouth
2 8
Warning
There’s something funny about the second example: look closely!
drop() deletes “extra” dimensionsp_XY that represents the joint pmf of \(X\) and \(Y\), under the assumption that \(X\) and \(Y\) are independent. Name the rows and columns.?rowSums() and ?colSums(). Then extract the marginal pmfs of \(X\) and \(Y\) from the matrix p_XY. list() creates a list, just like c() creates an atomic vector
[[1]]
[1] TRUE FALSE FALSE
[[2]]
[1] 3.141593
[[3]]
[,1] [,2]
[1,] 1 0
[2,] 0 1
str() tells us what’s inside:
[]
[[]]
When creating a list, you can name the elements as with c()
Now we can access objects by name
$lecturer
[1] "Frank"
[1] "Frank"
$NAME_HERE is a shortcut for [['NAME_HERE']]
A data frame is has type list and class data.frame
We can mix-and-match selection rules for lists and matrices:
name age grade favorite_color
1 Xerxes 19 65 blue
2 Xanthippe 23 70 red
3 Xanadu 21 68 orange
[1] 19
[1] 19 23 21
[1] "blue" "red" "orange"
[1] "Xerxes" "Xanthippe" "Xanadu"
name age grade favorite_color
1 Xerxes 19 65 blue
I used students$name == 'Xerxes' above. Why didn’t I instead use identical(students$name, 'Xerxes')?
Use the following code chunk to construct the employees data frame. Then display it.
employees <- data.frame(
name = c("Alice", "Bob", "Cathy", "David", "Eva",
"Frank", "Grace", "Hank", "Ivy", "Jack"),
age = c(25, 31, 28, 40, 35, 23, 30, 45, 33, 29),
department = c("HR", "IT", "Finance", "IT", "HR",
"Finance", "IT", "HR", "Finance", "IT"),
salary = c(50000, 60000, 55000, 70000, 53000,
51000, 62000, 71000, 57000, 59000)
)age column of employees.employees.Eva.IT department.
Comments
#is a comment#and then a space