Delving Deeper into R

A deeper look into topics that get skipped.

Murray Cadzow https://github.com/murraycadzow (University of Otago) , Matt Bixley (NeSI)
2021-07-05

Time: 90 minutes

Description: This workshop will extend your understanding of R and cover how to interact with non-data.frame structured data such as vectors, matrices and lists. We’ll also look at creating your own functions and loops.

Learning objectives:

This workshop will follow an explaination then try for yourself using exercises format

Motivation: Many frustrations of dealing with R stem from not understanding how R is working ‘under-the-hood’. We want to explain some of the details that get skipped as part of introductory courses so that you understand what is going on and how to delve into some of the oddities. We also want to show you how to start being able to implement solutions in R to tackle your specific problems.

Atomics

This chapter is useful for additional information: https://adv-r.hadley.nz/vectors-chap.html

4 main types of atomic vectors

Double and integer are collectively known as numeric vectors.

Using c() to combine. c() is used to combine atomic vectors together, and when the arguments are all atomic vectors it flattens the structure

chr_vec <- c("a", "b")
chr_vec
[1] "a" "b"
c(chr_vec, c(chr_vec, chr_vec))
[1] "a" "b" "a" "b" "a" "b"

We can use typeof to find out the atomic type of our vectors

typeof(chr_vec)
[1] "character"

Empty vectors

We can create empty vectors of specific lengths using the atomic types

double(length = 2)
[1] 0 0
numeric(length = 3)
[1] 0 0 0
[1] "" "" ""
logical(2)
[1] FALSE FALSE
double(0)
numeric(0)

Coercion

All elements of an atomic vector must be the same type, so when you attempt to combine different types they will be coerced to the most flexible type. Types from least to most flexible are: logical -> integer -> double -> character.

We can explicitly use coercion on vectors with the as.numeric, as.character, and as.logical functions.

Exercise

Create the following vectors:

  1. four numbers as a numeric called my_nums
  2. six words of your choice as a character called my_words
  3. an empty numeric vector of length 7 called num_results.
  4. an empty character vector of length 3 called char_results.
  5. combine the my_nums and my_words vectors into a single vector called nums_words

What do you notice are the values in the empty numeric and character vectors?

Data structures

This section is complimented by https://adv-r.hadley.nz/vectors-chap.html for a more in depth understanding and explanation.

Structures

In R there are two types of data structure, one for the storage of homogenous data i.e. the data has to be all of the same type such as numeric. The other is for the storage of heterogenous data i.e. the data can be of different types.

The main structures for homogenous data are the 1 dimensional vector, and the 2d matrix. Fo a matrix it has the requirement to have either all columns of equal length, and all rows of equal length.

example_vector <- c(1,2,5,6)
example_vector
[1] 1 2 5 6
example_matrix <- matrix(c(1:10), nrow = 2, byrow = TRUE)
example_matrix
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    6    7    8    9   10

The main heterogenous structures are the list and the data.frame (and tibble). A list is a series of elements and each can be of different data types and dimensions. It is a very versatile structure. A data.frame is a list of variables, with the same number of rows. Character vectors are converted to factors (from R v4+ this is no longer the default behaviour). A tibble is a special type of data.frame with a priority on printing and inspection and comes from the tibble package which is part of the tidyverse.

example_list <- list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.3, 5.9))
example_list
[[1]]
[1] 1 2 3

[[2]]
[1] "a"

[[3]]
[1]  TRUE FALSE  TRUE

[[4]]
[1] 2.3 5.9
example_df <- data.frame(col1 = c("a", "b", "c"), 
                         col2 = c(1, 2, 3), 
                         col3 = c(TRUE, FALSE, FALSE) )
example_df
  col1 col2  col3
1    a    1  TRUE
2    b    2 FALSE
3    c    3 FALSE
example_tib <- tibble::tibble(col1 = c("a", "b", "c"), 
                              col2 = c(1, 2, 3),
                              col3 = c(TRUE, FALSE, FALSE) )
example_tib
# A tibble: 3 x 3
  col1   col2 col3 
  <chr> <dbl> <lgl>
1 a         1 TRUE 
2 b         2 FALSE
3 c         3 FALSE
items <- c("fork" = 5, "table" = 1,"knife" = 6,"spoon" = 4)
names(items)

attributes(items)


attributes(mtcars)

Subsetting

https://adv-r.hadley.nz/subsetting.html

Subsetting operators: [, [[, and $

We’ll look at the differences between these operators later.

There are two main methods of subsetting in R, the first is by specifying the numbered positions (index) from the data structure we have. The second is by providing a logical vector - usually created through a conditional statement.

Conditional

These statements rely on a test (condition/comparison) that results in a boolean (TRUE/FALSE) to determine what gets subsetted (or in the context of functions, run). Boolean logic operators can be used in to modify or combine tests to result in a single TRUE or FALSE.

Boolean Operation Symbol in R
NOT !
OR |
AND &

These can be combined with the comparison operators (==, !=, <, <=, >, >=) to combine statements together into more complex logic statements. The result of the NOT, AND, and OR can be seen in the below logic table:

Statement Becomes
!TRUE FALSE
!FALSE TRUE
TRUE & TRUE TRUE
TRUE & FALSE FALSE
FALSE & TRUE FALSE
FALSE & FALSE FALSE
TRUE | TRUE TRUE
TRUE | FALSE TRUE
FALSE | TRUE TRUE
FALSE | FALSE FALSE

For subsetting the final logical vector tells R which items to pull out - the positions that are TRUE. This logical vector needs to either be the same length as the vector being subsetted, or a factor of the length (so the vector can be ‘recycled’).

my_vec <- c("cat","dog","mouse", "horse")

# pull out the first and last elements
my_vec[c(TRUE, FALSE, FALSE, TRUE)]
[1] "cat"   "horse"
# pull out the "odd" elements
my_vec[c(TRUE, FALSE)]
[1] "cat"   "mouse"

Positional/Index

The element positions/indexes can be used to perform subsetting. The indexes in R start from 1, unlike many other programming languages.

A negative in-front of the index means to remove that entry but you can’t mix negative and positive indexes in the same command.

The index zero returns a zero length vector of the vector atomic type. Using an empty [] will return the entire vector. Specifying an index multiple times will duplicate the element.

my_vec
[1] "cat"   "dog"   "mouse" "horse"
my_vec[1]
[1] "cat"
my_vec[-1]
[1] "dog"   "mouse" "horse"
my_vec[c(-2, -4)]
[1] "cat"   "mouse"
my_vec[c(4,4,4)]
[1] "horse" "horse" "horse"
my_vec[0]
character(0)
my_vec[]
[1] "cat"   "dog"   "mouse" "horse"
`[`(my_vec, 3)
[1] "mouse"
example_list
[[1]]
[1] 1 2 3

[[2]]
[1] "a"

[[3]]
[1]  TRUE FALSE  TRUE

[[4]]
[1] 2.3 5.9
example_list[[c(1,2)]]
[1] 2
example_list[c(1,2)]
[[1]]
[1] 1 2 3

[[2]]
[1] "a"

Differences in subset operators

[ (single square bracket) versus [[ (double square bracket) versus $ (dollar sign)

[ will depending on the data structure either return the original structure, or reduce the dimensions of the returned structure. Use the [ for 1 dimensional structures. Be careful when subsetting factors. For 2 dimensional structures the format is [row indexes, column indexes] for indexes or [condtion to select rows, condition to select columns] for the conditional based subsetting.

Data.frame

my_df <- data.frame(column1 = 1:3, numbers = 4:6)
my_df
  column1 numbers
1       1       4
2       2       5
3       3       6
str(my_df[1,1])
 int 1
str(my_df[1,])
'data.frame':   1 obs. of  2 variables:
 $ column1: int 1
 $ numbers: int 4
str(my_df[,1])
 int [1:3] 1 2 3
str(my_df[,"numbers"])
 int [1:3] 4 5 6
str(my_df[1,1, drop = FALSE])
'data.frame':   1 obs. of  1 variable:
 $ column1: int 1

Tibble

my_tib <- tibble::tibble(col1 = 1:3, col2 = 4:6)
my_tib
# A tibble: 3 x 2
   col1  col2
  <int> <int>
1     1     4
2     2     5
3     3     6
str(my_tib[1,1])
tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
 $ col1: int 1
str(my_tib[1,])
tibble [1 × 2] (S3: tbl_df/tbl/data.frame)
 $ col1: int 1
 $ col2: int 4
str(my_tib[,1])
tibble [3 × 1] (S3: tbl_df/tbl/data.frame)
 $ col1: int [1:3] 1 2 3
str(my_tib[,"col2"])
tibble [3 × 1] (S3: tbl_df/tbl/data.frame)
 $ col2: int [1:3] 4 5 6
str(my_tib[1,1, drop = FALSE])
tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
 $ col1: int 1

Matrix

my_mat <- matrix(1:4, nrow = 2, ncol =2)
my_mat
     [,1] [,2]
[1,]    1    3
[2,]    2    4
str(my_mat[1,1])
 int 1
str(my_mat[1,])
 int [1:2] 1 3
str(my_mat[,1])
 int [1:2] 1 2
str(my_mat[1,1, drop = FALSE])
 int [1, 1] 1

List

my_list <- list(item1 = "a", item2 = 2, item3 = TRUE)
my_list
$item1
[1] "a"

$item2
[1] 2

$item3
[1] TRUE
str(my_list[1])
List of 1
 $ item1: chr "a"
str(my_list["item1"])
List of 1
 $ item1: chr "a"

[[ will return elements as a vector and for 2 dimensional structures refers to the columns of the object.

str(my_df[[1]])
 int [1:3] 1 2 3
str(my_df[["numbers"]])
 int [1:3] 4 5 6
str(my_tib[[1]])
 int [1:3] 1 2 3
str(my_tib[["col2"]])
 int [1:3] 4 5 6
str(my_mat[[1]])
 int 1
str(my_list$item1)
 chr "a"

$ is used for subsetting using a name (but not on a vector), but it also does partial matching

my_df <- data.frame(column1 = 1:3, numbers = 4:6)

str(my_df$column1)
 int [1:3] 1 2 3
str(my_df$c)
Warning in my_df$c: partial match of 'c' to 'column1'
 int [1:3] 1 2 3

Exercises

Using the starwars data (dplyr::starwars):

Functions

A function in R is comprised of four parts:

  1. a name
  2. inputs (arguments/variables)
  3. the body (the code that does something)
  4. output (what gets returned after the body has run)

As you have been using R you will have noticed that many tasks have a particular function already available for you to use, such as mean or sd. In this section we are going to learn how to make our own functions. We can define our own functions using the function() function. Inside the parenthesis we define what variables are going to be passed to our function and curly braces contain the body of the function. If we want to return a value from our function R will automatically return the result of the last line of the function body or we end can do so explicitly with return(). We can assign this new function to a variable so that we call on it later, it is possible to have an anonymous function but these are usually found as part of map or the apply family but we won’t be covering anonymous functions in this workshop. To call our new function we now use the variable name and pass any required arguments.

Here is an example of how to create a function:

name <- function(variables) {
  
}

NB: in RStudio you can get a code snippet/template by typing “fun” and hitting <tab>

Here is an example function that will double the value of the provided number:

# Doubles the provided number
double <- function( num ){
  num * 2
}

double(2)
[1] 4

Important: Functions don’t auto-update when you modify the code that creates them, you must re-run the entire function code block.

We can also have multiple arguments for our functions:

# Calculates BMI on a supplied height (m) and weight (kg)
calcBMI <- function(height, weight){
  weight / height ^ 2
}


calcBMI(height = 1.68, weight = 73)
[1] 25.86451

NB: Variables declared only inside a function don’t exist outside of the function – see the Scope section.

Iteration

What is the point of learning about iteration? Similar to the reasons to create functions, iteration provides us a tool to be able to do repetitive tasks without having to copy and paste a lot of code. Take for instance the following example code that would read in csv files for a given country and then calculate the mean GDP for each:

data_nzl <- read_csv("gapminder_countries/nzl.csv")
data_aus <- read_csv("gapminder_countries/aus.csv")
data_usa <- read_csv("gapminder_countries/usa.csv")

# calculate the mean gdp from each country
mean(data_nzl$gdp)
mean(data_aus$gdp)
mean(data_usa$gdp)

Notice that there is a lot of code duplication (read_csv, and mean are duplicated for each country). In this example there is also the inclusion of a typo which is a very common mistake to make when changing inputs after copy and pasting - did you spot it? What happens if we need include another 20 or 100 countries? What happens if we also needed to calculate the median GDP for each? It quickly becomes quite laborious to scale. This is where iteration is useful, as it is all about providing a mechanism to specify how to repeat things.

In an abstract form, the above example could be captured like this:

1. make a list of all the csv files
2. for each csv file in the list:
    - calculate the mean of the gdp column

This abstraction of the problem now gives us the steps to follow and deals with the heart of the problem rather than having to worry about a specific implementation.

The for Loop

Palmer Penguins Dataset

We are going to use the Palmer Penguins dataset with our for loops. A set of Phenotypes from 3 Species and 3 Islands put together by Alison Horst

for loops usually contain the following parts:

  1. an output - somewhere to store the results from the loop
  2. an input set of items to do something to (a vector)
  3. loop body - code that will do something for a single value of the set.

The most common loop is the for loop. The template is as follows:

for (variable in vector) {
  # loop body i.e. what to do each time
}

NB: You can get the for loop code snippet in RStudio by typing for then hitting <tab> and selecting “for {snippet}” from the drop down.

When you see a for loop you can read it like a sentence: for each thing in my collection of things; I will do something to first thing; and then choose the next thing, do something, and repeat, until I have done something to each of my things in my collection.

We’ll compare this snippet to the following example which will print out the numbers 1 to 3 to explain what is going on.

for( num in 1:3 ){
  print( num )
}
[1] 1
[1] 2
[1] 3

In this example, we wanted to print out each item from our set. Our set was a vector of numbers 1 to 3 in this case (in R a vector with a range of numbers can be made using the : operator in the format start:end). The task that we will do repetitively is print – our loop body. num is going to store the value of the current item. Our vector or collection is the numbers 1 to 3.

The loop gets run as such:

  1. num takes on the first value from our set (1)
  2. the loop body runs (prints the value of num which is 1)
  3. there is nothing more to be done in the body so it moves onto the next item
  4. num takes on the second value from the set (2)
  5. the loop body runs (prints the value of num which is 2)
  6. there is nothing more to be done in the body so it moves to the next item
  7. num takes on the third value from out set (3)
  8. the loop body runs (prints the value of num which is 3)
  9. there is nothing more to be done in the body so it moves to the next item
  10. there is not a next item so the loop exits.

This is how we could have achieved this task without a loop:

print(1)
[1] 1
print(2)
[1] 2
print(3)
[1] 3

The duplication is apparent but not particularly laborious in this case. However, think about how this would scale. What if suddenly you needed to print the numbers 1 to 100, or 1000, or 10000? Using the copy-paste print() and manually filling in numbers is going to be pretty laborious and highly risky of typos. Using the for loop however, scales extremely easily and would be a matter of only changing what our collection of items going in was, everything else remains the same:

for( num in 1:10000 ){
  print( num )
}

Exercise

Print out the Column Names of the Penguins Dataset, 1 at a time.

for loop with indices

One version of the for loop that you might encounter (especially in other languages) is a version that uses indices to determine the current item from the set. In this case rather than the loop variable using the values of the items themselves, it uses the index of the item in the collection. Traditionally the loop variable is called i in this situation. While you could specify the indices manually through a vector e.g. 1:5, or 1:length(myvector), this can lead to some issues and the safer way is for R to generate the indices using seq_along() which returns a vector with all the indexes of your object.

myNumbers <- c(11,13,15,17,19)

# show example of what seq_along() is providing
seq_along(myNumbers)
[1] 1 2 3 4 5
# print each number and the index used from the collection by using the index to subset
for( i in seq_along(myNumbers) ){
  print(paste("number =", myNumbers[i], "index (i) =", i))
}
[1] "number = 11 index (i) = 1"
[1] "number = 13 index (i) = 2"
[1] "number = 15 index (i) = 3"
[1] "number = 17 index (i) = 4"
[1] "number = 19 index (i) = 5"

Exercise

Using indices, calculate the mean of the 2nd, 3rd and 5th Columns

Nesting of loops

It’s generally recommended to avoid the nesting of loops within other loops. Let’s say the outer loop has a total of n iterations to get through and an inner loop has m. Every time we add one extra iteration of the outer loop we end up adding an extra m iterations of the inner loop, so the total number of iterations is n * m. Depending on how big m is, this could be adding thousands or millions of extra iterations, causing your code to take longer to run. Some times however, nesting is unavoidable but it’s a good idea to keep an eye out for nesting if your code is taking a while to run as this is usually the first place things can be sped up.

sex <- c("female", "male")
#species <- c("Adelie","Chinstrap","Gentoo" )
species <- levels(penguins$species)

for (i in species) {
  for (j in sex) {
    
    # actions
    # subset the data
    new_data <- subset(penguins, penguins$species == i & penguins$sex == j) 
    
    # calculate something
    mean_value <- mean(new_data$body_mass_g, na.rm = T)
    
    # return a value
    print(paste("The Average weight of",j,i,"penguins =",round(mean_value/1000,2),"Kgs"))
    
  }
  
}
[1] "The Average weight of female Adelie penguins = 3.37 Kgs"
[1] "The Average weight of male Adelie penguins = 4.04 Kgs"
[1] "The Average weight of female Chinstrap penguins = 3.53 Kgs"
[1] "The Average weight of male Chinstrap penguins = 3.94 Kgs"
[1] "The Average weight of female Gentoo penguins = 4.68 Kgs"
[1] "The Average weight of male Gentoo penguins = 5.48 Kgs"

Running Scripts