More Tidyverse

Take the time to explore the lesser known functions and packages of the Tidyverse

Murray Cadzow (University of Otago) , Matt Bixley (NeSI)
2021-07-05

Time: 90 min

Description: Take the time to explore the lesser known functions and packages of the Tidyverse. Learn about how to access googlesheets, use times/dates, manipulate text, use functional programming to replace loops, and more.

Learning objectives

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. - https://www.tidyverse.org

Two excellent resources: - R for Data Science - Tidyverse Cookbook

Core Tidyverse

── Attaching packages ───────────────────────────── tidyverse 1.3.1 ──
✓ ggplot2 3.3.3     ✓ purrr   0.3.4
✓ tibble  3.1.2     ✓ dplyr   1.0.6
✓ tidyr   1.1.3     ✓ stringr 1.4.0
✓ readr   1.4.0     ✓ forcats 0.5.1
── Conflicts ──────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

tibble

tibble::tribble is a useful function if you want to manually input a small amount of data.

tibble::tribble( ~column1, ~column2, ~column3,
                 "a", 1, TRUE,
                 "b", 2, FALSE)
# A tibble: 2 x 3
  column1 column2 column3
  <chr>     <dbl> <lgl>  
1 a             1 TRUE   
2 b             2 FALSE  

tibble::glimpse is similar to str but can be embedded in pipelines as it invisibly returns the original data.

tibble::glimpse(mtcars)
Rows: 32
Columns: 11
$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 1…
$ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7,…
$ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3…
$ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190,…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00,…
$ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1…
$ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2…

Exercise

Use tribble() to manually input some data

dplyr

dplyr::rename provides the renaming ability of dplyr::select but it doesn’t subset.

dplyr::transmute is similar to dplyr::mutate except only columns specified are kept.

dplyr::everything/dplyr::starts_with/dplyr::contains/dplyr::ends_with are useful helper functions to enable easy selection of columns. These functions technically live in the tidyselect package, along with tidyselect::where. tidyselect::where can be used to supply a conditional type function such as is.numeric to enable the selection of columns.

dplyr::relocate is a very useful function that enables you to relocate columns to specific locations. It uses a .before or .after argument to specify where you to more the column(s) to.

dplyr::across lets you apply a function or functions across multiple columns. For applying a function based on data type, pair with tidyselect::where. This is usually used as part of a dplyr::summarise or dplyr::mutate statement.

dplyr::c_across lets you combine values from across multiple columns, such as performing a summarising function on select columns.

dplyr::case_when is useful in situations where you would have multiple ifelses.

Exercise

tidyr

tidyr::nest will for per row, condense the specified columns into a list, which is stored in a single column (list column). tidyr::unnest will take the elements from a list column into their own columns. These functions are particularly useful in conjunction with pivot_longer/pivot_wider

tidyr::unite/tidyr::separate will either combine columns with a separator, or split to new columns based on a separator. tidyr::separate_rows will separate a column based on a separator but into separate rows, duplicating the related row data with it.

tidyr::crossing provides a mechanism to create all permutations by ‘crossing’ the values in two vectors. This is particularly useful when combined with purrr for running a function on all sub-groups within your data.

tidyr::drop_na is function that provides a way to remove rows that contain missing data. If a column(s) is specified rows are only removed if the missing data is in the specific columns.

tidyr::replace_na is a function that lets you specify replacement values for your missing values. The values are specified using a named list, with the names corresponding to the columns your want to replace the data in.

Exercise

Create a data.frame/tibble from the starwars dataset which shows one character/film combo per row for all the films and characters.

# starwars dataset
dplyr::starwars

stringr

stringr::str_detect lets you use regular expressions or straight text to check to see if it is in any of the values of a column and returns a logical vector. This function is very useful to use as part of a dplyr::filter.

stringr::str_remove will remove the first instance of the text that matches the pattern from the values in your column, stringr::str_remove_all will remove all instances.

stringr::str_extract does the opposite of stringr::str_remove, in that it returns the first instance of the text that matches the pattern, stringr::str_extract_all will return all instances of the matches.

Exercise

From the dplyr::starwars data, find all of the characters that have ‘grey’ as part of their skin colour description. How many different descriptions are there that contain ‘grey’?

readr

readr::parse_number is an extremely useful function to know about if you are reading data into R that you know is numerical in nature but might contain extra characters such as units.

text_to_parse <- c(" 0.4m", "-6", "a5", "1E-2", "24%", "3e2")

readr::parse_number(text_to_parse)
[1]   0.40  -6.00   5.00   0.01  24.00 300.00

ggplot2

The majority of ggplot2 functions are specific to the type of visualisation that is being created and well covered by the cheatsheet. There is one however that can be very helpful to know about.

ggplot2::theme_set allows you to apply a theme to all of your ggplots. It would usually be called near the start of a script.

# Set the theme to theme_bw for all following plots
ggplot2::theme_set(theme_bw())

forcats

forcats::reorder lets you reorder your factor levels by sorting against another variable. forcats::reorder2 is the same as reorder but you can use two variables. The default function for sorting is based on the median.

df <- tibble::tribble(
  ~color,     ~a, ~b,
  "blue",      1,  2,
  "green",     6,  2,
  "purple",    3,  3,
  "red",       2,  3,
  "yellow",    5,  1
)
df$color <- factor(df$color)
fct_reorder(df$color, df$a, min)
[1] blue   green  purple red    yellow
Levels: blue red purple yellow green
fct_reorder2(df$color, df$a, df$b)
[1] blue   green  purple red    yellow
Levels: purple red blue green yellow

forcats::infreq will reorder the factor levels based on the frequency of the observations (highest first), and forcats::fct_rev will reverse that order (lowest first).

forcats::relevel lets you manually reorder the levels in your factor. This is useful for when you want to move a particular level such as NA to the end.

my_fct <- factor(c("blue","green","blue", "none", "blue", "purple", "purple"))
my_fct
[1] blue   green  blue   none   blue   purple purple
Levels: blue green none purple
fct_relevel(my_fct, "none")
[1] blue   green  blue   none   blue   purple purple
Levels: none blue green purple
# send to the end
fct_relevel(my_fct, "none", after = Inf)
[1] blue   green  blue   none   blue   purple purple
Levels: blue green purple none

Exercise

Alter the following code that creates a bar plot to use forcats to re-order the bars the species of penguins are ordered by frequency, lowest on the left.

library(tidyverse)
library(palmerpenguins)

penguins %>% 
  ggplot(aes(x = species)) + geom_bar()

purrr

We’ll cover purrr in Functional Programming

Extra Tidyverse

library(googlesheets4)

# grab the url from the browser for your sheet (specific to a sheet)
url <- "https://docs.google.com/spreadsheets/d/1MbE2_XUfQ9KwfKAJhEDPb6KgOg2EaoXr5IN2F-hjBNI/edit#gid=0"

# read the sheet in
my_google_sheet <- read_sheet(url)

The first time running read_sheet you will be asked to authenticate and a web broswer will open up.

Functional Programming

Map

From the package purrr (part of the tidyverse), there are a collection of map functions which are a method of iterating over a collection of things applying a function. This is known as functional programming, and allows us to extract the code that is in common for a loop, into a function, so rather than being concerned about the set-up of the loop, we can focus on the contents of the loop. This idea of mapping a function onto data is extremely similar to the concept underlying the for loop.

Map and friends

The package purrr within the tidyverse provides the map functions that take a vector or list as the first argument, and the second argument is the function to be run on each item in the vector or list. The object that is returned back with the results depends on the exact version of map that is called, the default map() returns the results as items in a list, but there are suffix versions of map that will return the results back in a specified data type.

These suffix versions will give an error if the data type of the results doesn’t match. This is useful for being able to program with, as it means that you can be sure that you have a particular data type for future code. Some of the base R functions that you will meet in the next section don’t provide this guarantee. The arguments to the map functions are .x which is the vector or list input, and .f which is the name of the function. if the supplied function takes multiple arguments these can be passed in as extra arguments to map.

We could use the example of converting some temperatures to demonstrate

library(purrr)

farenheit_to_celcius <- function(temp_f){
  temp_c <- (temp_f -32) * 5/9
  return(temp_c)
}

my_temps_f <- c(90, 78, 88, 89, 77)

# gives back a list
my_temps_c_list <- map(.x = my_temps_f, .f = farenheit_to_celcius)
my_temps_c_list
[[1]]
[1] 32.22222

[[2]]
[1] 25.55556

[[3]]
[1] 31.11111

[[4]]
[1] 31.66667

[[5]]
[1] 25
# gives back a vector of type numeric/double
my_temps_c_dbl <- map_dbl(.x = my_temps_f, .f = farenheit_to_celcius)
my_temps_c_dbl
[1] 32.22222 25.55556 31.11111 31.66667 25.00000

When using map and variants, don’t include the ()’s on the function name, if you do you’ll get this error:

map(.x = my_temps_f, .f = farenheit_to_celcius())
Error in farenheit_to_celcius(): argument "temp_f" is missing, with no default

Using a formula

You can create anonymous functions for purrr to use.

For instance, what if you had a list containing multiple weeks of temperatures and you wanted to find the mean temp per week.

monthly_temps <- list(week1 = c(20,21,23,NA), 
                      week2 = c(15,16,14,17,20), 
                      week3 = c(17,15,NA,17,18,14))

map(monthly_temps, mean)
$week1
[1] NA

$week2
[1] 16.4

$week3
[1] NA

Because we have NAs mean returns NA. We could define a new version of mean that removes NAs

mean_na_rm <- function(x){
  mean(x, na.rm = TRUE)
}

map(monthly_temps, mean_na_rm)
$week1
[1] 21.33333

$week2
[1] 16.4

$week3
[1] 16.2

But we can also use the formula syntax to create an anonymous function using the ~, and with the similar idea as . for data in pipes, we use .x to represent the data in our function.

map(monthly_temps, ~mean(.x, na.rm = TRUE))
$week1
[1] 21.33333

$week2
[1] 16.4

$week3
[1] 16.2