A deeper look into topics that get skipped.
Time: 90 minutes
Description: This workshop will extend your understanding of R and cover how to interact with non-data.frame structured data such as vectors, matrices and lists. We’ll also look at creating your own functions and loops.
Learning objectives:
This workshop will follow an explaination then try for yourself using exercises format
Motivation: Many frustrations of dealing with R stem from not understanding how R is working ‘under-the-hood’. We want to explain some of the details that get skipped as part of introductory courses so that you understand what is going on and how to delve into some of the oddities. We also want to show you how to start being able to implement solutions in R to tackle your specific problems.
This chapter is useful for additional information: https://adv-r.hadley.nz/vectors-chap.html
4 main types of atomic vectors
Double and integer are collectively known as numeric vectors.
Using c()
to combine. c()
is used to combine atomic vectors together, and when the arguments are all atomic vectors it flattens the structure
We can use typeof
to find out the atomic type of our vectors
typeof(chr_vec)
[1] "character"
We can create empty vectors of specific lengths using the atomic types
double(length = 2)
[1] 0 0
numeric(length = 3)
[1] 0 0 0
character(3)
[1] "" "" ""
logical(2)
[1] FALSE FALSE
double(0)
numeric(0)
All elements of an atomic vector must be the same type, so when you attempt to combine different types they will be coerced to the most flexible type. Types from least to most flexible are: logical -> integer -> double -> character.
We can explicitly use coercion on vectors with the as.numeric
, as.character
, and as.logical
functions.
Create the following vectors:
my_nums
my_words
num_results
.char_results
.my_nums
and my_words
vectors into a single vector called nums_words
What do you notice are the values in the empty numeric and character vectors?
This section is complimented by https://adv-r.hadley.nz/vectors-chap.html for a more in depth understanding and explanation.
In R there are two types of data structure, one for the storage of homogenous data i.e. the data has to be all of the same type such as numeric. The other is for the storage of heterogenous data i.e. the data can be of different types.
The main structures for homogenous data are the 1 dimensional vector, and the 2d matrix. Fo a matrix it has the requirement to have either all columns of equal length, and all rows of equal length.
example_vector <- c(1,2,5,6)
example_vector
[1] 1 2 5 6
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 6 7 8 9 10
The main heterogenous structures are the list and the data.frame (and tibble). A list is a series of elements and each can be of different data types and dimensions. It is a very versatile structure. A data.frame is a list of variables, with the same number of rows. Character vectors are converted to factors (from R v4+ this is no longer the default behaviour). A tibble is a special type of data.frame with a priority on printing and inspection and comes from the tibble
package which is part of the tidyverse
.
[[1]]
[1] 1 2 3
[[2]]
[1] "a"
[[3]]
[1] TRUE FALSE TRUE
[[4]]
[1] 2.3 5.9
example_df <- data.frame(col1 = c("a", "b", "c"),
col2 = c(1, 2, 3),
col3 = c(TRUE, FALSE, FALSE) )
example_df
col1 col2 col3
1 a 1 TRUE
2 b 2 FALSE
3 c 3 FALSE
example_tib <- tibble::tibble(col1 = c("a", "b", "c"),
col2 = c(1, 2, 3),
col3 = c(TRUE, FALSE, FALSE) )
example_tib
# A tibble: 3 x 3
col1 col2 col3
<chr> <dbl> <lgl>
1 a 1 TRUE
2 b 2 FALSE
3 c 3 FALSE
str
class
attributes
typeof
length
dim
items <- c("fork" = 5, "table" = 1,"knife" = 6,"spoon" = 4)
names(items)
attributes(items)
attributes(mtcars)
https://adv-r.hadley.nz/subsetting.html
Subsetting operators: [
, [[
, and $
We’ll look at the differences between these operators later.
There are two main methods of subsetting in R, the first is by specifying the numbered positions (index) from the data structure we have. The second is by providing a logical vector - usually created through a conditional statement.
These statements rely on a test (condition/comparison) that results in a boolean (TRUE
/FALSE
) to determine what gets subsetted (or in the context of functions, run). Boolean logic operators can be used in to modify or combine tests to result in a single TRUE
or FALSE
.
Boolean Operation | Symbol in R |
---|---|
NOT | ! |
OR | | |
AND | & |
These can be combined with the comparison operators (==
, !=
, <
, <=
, >
, >=
) to combine statements together into more complex logic statements. The result of the NOT, AND, and OR can be seen in the below logic table:
Statement | Becomes | ||
---|---|---|---|
!TRUE | FALSE | ||
!FALSE | TRUE | ||
TRUE & TRUE | TRUE | ||
TRUE & FALSE | FALSE | ||
FALSE & TRUE | FALSE | ||
FALSE & FALSE | FALSE | ||
TRUE | TRUE | TRUE | ||
TRUE | FALSE | TRUE | ||
FALSE | TRUE | TRUE | ||
FALSE | FALSE | FALSE |
For subsetting the final logical vector tells R which items to pull out - the positions that are TRUE. This logical vector needs to either be the same length as the vector being subsetted, or a factor of the length (so the vector can be ‘recycled’).
my_vec <- c("cat","dog","mouse", "horse")
# pull out the first and last elements
my_vec[c(TRUE, FALSE, FALSE, TRUE)]
[1] "cat" "horse"
# pull out the "odd" elements
my_vec[c(TRUE, FALSE)]
[1] "cat" "mouse"
The element positions/indexes can be used to perform subsetting. The indexes in R start from 1, unlike many other programming languages.
A negative in-front of the index means to remove that entry but you can’t mix negative and positive indexes in the same command.
The index zero returns a zero length vector of the vector atomic type. Using an empty []
will return the entire vector. Specifying an index multiple times will duplicate the element.
my_vec
[1] "cat" "dog" "mouse" "horse"
my_vec[1]
[1] "cat"
my_vec[-1]
[1] "dog" "mouse" "horse"
my_vec[c(-2, -4)]
[1] "cat" "mouse"
my_vec[c(4,4,4)]
[1] "horse" "horse" "horse"
my_vec[0]
character(0)
my_vec[]
[1] "cat" "dog" "mouse" "horse"
`[`(my_vec, 3)
[1] "mouse"
example_list
[[1]]
[1] 1 2 3
[[2]]
[1] "a"
[[3]]
[1] TRUE FALSE TRUE
[[4]]
[1] 2.3 5.9
example_list[[c(1,2)]]
[1] 2
example_list[c(1,2)]
[[1]]
[1] 1 2 3
[[2]]
[1] "a"
[
(single square bracket) versus [[
(double square bracket) versus $
(dollar sign)
[
will depending on the data structure either return the original structure, or reduce the dimensions of the returned structure. Use the [
for 1 dimensional structures. Be careful when subsetting factors. For 2 dimensional structures the format is [row indexes, column indexes] for indexes or [condtion to select rows, condition to select columns] for the conditional based subsetting.
Data.frame
my_df <- data.frame(column1 = 1:3, numbers = 4:6)
my_df
column1 numbers
1 1 4
2 2 5
3 3 6
str(my_df[1,1])
int 1
str(my_df[1,])
'data.frame': 1 obs. of 2 variables:
$ column1: int 1
$ numbers: int 4
str(my_df[,1])
int [1:3] 1 2 3
str(my_df[,"numbers"])
int [1:3] 4 5 6
str(my_df[1,1, drop = FALSE])
'data.frame': 1 obs. of 1 variable:
$ column1: int 1
Tibble
my_tib <- tibble::tibble(col1 = 1:3, col2 = 4:6)
my_tib
# A tibble: 3 x 2
col1 col2
<int> <int>
1 1 4
2 2 5
3 3 6
str(my_tib[1,1])
tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
$ col1: int 1
str(my_tib[1,])
tibble [1 × 2] (S3: tbl_df/tbl/data.frame)
$ col1: int 1
$ col2: int 4
str(my_tib[,1])
tibble [3 × 1] (S3: tbl_df/tbl/data.frame)
$ col1: int [1:3] 1 2 3
str(my_tib[,"col2"])
tibble [3 × 1] (S3: tbl_df/tbl/data.frame)
$ col2: int [1:3] 4 5 6
str(my_tib[1,1, drop = FALSE])
tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
$ col1: int 1
Matrix
my_mat <- matrix(1:4, nrow = 2, ncol =2)
my_mat
[,1] [,2]
[1,] 1 3
[2,] 2 4
str(my_mat[1,1])
int 1
str(my_mat[1,])
int [1:2] 1 3
str(my_mat[,1])
int [1:2] 1 2
str(my_mat[1,1, drop = FALSE])
int [1, 1] 1
List
my_list <- list(item1 = "a", item2 = 2, item3 = TRUE)
my_list
$item1
[1] "a"
$item2
[1] 2
$item3
[1] TRUE
str(my_list[1])
List of 1
$ item1: chr "a"
str(my_list["item1"])
List of 1
$ item1: chr "a"
[[
will return elements as a vector and for 2 dimensional structures refers to the columns of the object.
str(my_df[[1]])
int [1:3] 1 2 3
str(my_df[["numbers"]])
int [1:3] 4 5 6
str(my_tib[[1]])
int [1:3] 1 2 3
str(my_tib[["col2"]])
int [1:3] 4 5 6
str(my_mat[[1]])
int 1
str(my_list$item1)
chr "a"
$
is used for subsetting using a name (but not on a vector), but it also does partial matching
my_df <- data.frame(column1 = 1:3, numbers = 4:6)
str(my_df$column1)
int [1:3] 1 2 3
str(my_df$c)
Warning in my_df$c: partial match of 'c' to 'column1'
int [1:3] 1 2 3
Using the starwars data (dplyr::starwars
):
luke_films
and assign to it the list of films that Luke Skywalker was in.films
and assign it the entire contents of the starwars$films
column using [[
A function in R is comprised of four parts:
As you have been using R you will have noticed that many tasks have a particular function already available for you to use, such as mean
or sd
. In this section we are going to learn how to make our own functions. We can define our own functions using the function()
function. Inside the parenthesis we define what variables are going to be passed to our function and curly braces contain the body of the function. If we want to return a value from our function R will automatically return the result of the last line of the function body or we end can do so explicitly with return()
. We can assign this new function to a variable so that we call on it later, it is possible to have an anonymous function but these are usually found as part of map or the apply family but we won’t be covering anonymous functions in this workshop. To call our new function we now use the variable name and pass any required arguments.
Here is an example of how to create a function:
name <- function(variables) {
}
NB: in RStudio you can get a code snippet/template by typing “fun” and hitting <tab>
Here is an example function that will double the value of the provided number:
# Doubles the provided number
double <- function( num ){
num * 2
}
double(2)
[1] 4
Important: Functions don’t auto-update when you modify the code that creates them, you must re-run the entire function code block.
We can also have multiple arguments for our functions:
# Calculates BMI on a supplied height (m) and weight (kg)
calcBMI <- function(height, weight){
weight / height ^ 2
}
calcBMI(height = 1.68, weight = 73)
[1] 25.86451
NB: Variables declared only inside a function don’t exist outside of the function – see the Scope section.
What is the point of learning about iteration? Similar to the reasons to create functions, iteration provides us a tool to be able to do repetitive tasks without having to copy and paste a lot of code. Take for instance the following example code that would read in csv files for a given country and then calculate the mean GDP for each:
Notice that there is a lot of code duplication (read_csv
, and mean
are duplicated for each country). In this example there is also the inclusion of a typo which is a very common mistake to make when changing inputs after copy and pasting - did you spot it? What happens if we need include another 20 or 100 countries? What happens if we also needed to calculate the median GDP for each? It quickly becomes quite laborious to scale. This is where iteration is useful, as it is all about providing a mechanism to specify how to repeat things.
In an abstract form, the above example could be captured like this:
1. make a list of all the csv files
2. for each csv file in the list:
- calculate the mean of the gdp column
This abstraction of the problem now gives us the steps to follow and deals with the heart of the problem rather than having to worry about a specific implementation.
for
LoopWe are going to use the Palmer Penguins dataset with our for
loops. A set of Phenotypes from 3 Species and 3 Islands put together by Alison Horst
if (!require("palmerpenguins")) install.packages("palmerpenguins")
library(palmerpenguins)
for
loops usually contain the following parts:
The most common loop is the for loop. The template is as follows:
for (variable in vector) {
# loop body i.e. what to do each time
}
NB: You can get the for
loop code snippet in RStudio by typing for
then hitting <tab>
and selecting “for {snippet}” from the drop down.
When you see a for
loop you can read it like a sentence: for each thing in my collection of things; I will do something to first thing; and then choose the next thing, do something, and repeat, until I have done something to each of my things in my collection.
We’ll compare this snippet to the following example which will print out the numbers 1 to 3 to explain what is going on.
for( num in 1:3 ){
print( num )
}
[1] 1
[1] 2
[1] 3
In this example, we wanted to print
out each item from our set. Our set was a vector of numbers 1 to 3 in this case (in R a vector with a range of numbers can be made using the :
operator in the format start:end
). The task that we will do repetitively is print
– our loop body. num
is going to store the value of the current item. Our vector or collection is the numbers 1 to 3.
The loop gets run as such:
num
takes on the first value from our set (1
)num
which is 1
)num
takes on the second value from the set (2
)num
which is 2
)num
takes on the third value from out set (3
)num
which is 3
)This is how we could have achieved this task without a loop:
The duplication is apparent but not particularly laborious in this case. However, think about how this would scale. What if suddenly you needed to print the numbers 1 to 100, or 1000, or 10000? Using the copy-paste print()
and manually filling in numbers is going to be pretty laborious and highly risky of typos. Using the for
loop however, scales extremely easily and would be a matter of only changing what our collection of items going in was, everything else remains the same:
for( num in 1:10000 ){
print( num )
}
Print out the Column Names of the Penguins Dataset, 1 at a time.
for
loop with indicesOne version of the for
loop that you might encounter (especially in other languages) is a version that uses indices to determine the current item from the set. In this case rather than the loop variable using the values of the items themselves, it uses the index of the item in the collection. Traditionally the loop variable is called i
in this situation. While you could specify the indices manually through a vector e.g. 1:5
, or 1:length(myvector)
, this can lead to some issues and the safer way is for R to generate the indices using seq_along()
which returns a vector with all the indexes of your object.
[1] 1 2 3 4 5
# print each number and the index used from the collection by using the index to subset
for( i in seq_along(myNumbers) ){
print(paste("number =", myNumbers[i], "index (i) =", i))
}
[1] "number = 11 index (i) = 1"
[1] "number = 13 index (i) = 2"
[1] "number = 15 index (i) = 3"
[1] "number = 17 index (i) = 4"
[1] "number = 19 index (i) = 5"
Using indices, calculate the mean of the 2nd, 3rd and 5th Columns
It’s generally recommended to avoid the nesting of loops within other loops. Let’s say the outer loop has a total of n
iterations to get through and an inner loop has m
. Every time we add one extra iteration of the outer loop we end up adding an extra m
iterations of the inner loop, so the total number of iterations is n * m
. Depending on how big m
is, this could be adding thousands or millions of extra iterations, causing your code to take longer to run. Some times however, nesting is unavoidable but it’s a good idea to keep an eye out for nesting if your code is taking a while to run as this is usually the first place things can be sped up.
sex <- c("female", "male")
#species <- c("Adelie","Chinstrap","Gentoo" )
species <- levels(penguins$species)
for (i in species) {
for (j in sex) {
# actions
# subset the data
new_data <- subset(penguins, penguins$species == i & penguins$sex == j)
# calculate something
mean_value <- mean(new_data$body_mass_g, na.rm = T)
# return a value
print(paste("The Average weight of",j,i,"penguins =",round(mean_value/1000,2),"Kgs"))
}
}
[1] "The Average weight of female Adelie penguins = 3.37 Kgs"
[1] "The Average weight of male Adelie penguins = 4.04 Kgs"
[1] "The Average weight of female Chinstrap penguins = 3.53 Kgs"
[1] "The Average weight of male Chinstrap penguins = 3.94 Kgs"
[1] "The Average weight of female Gentoo penguins = 4.68 Kgs"
[1] "The Average weight of male Gentoo penguins = 5.48 Kgs"