Take the time to explore the lesser known functions and packages of the Tidyverse
Time: 90 min
Description: Take the time to explore the lesser known functions and packages of the Tidyverse. Learn about how to access googlesheets, use times/dates, manipulate text, use functional programming to replace loops, and more.
Learning objectives
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. - https://www.tidyverse.org
Two excellent resources: - R for Data Science - Tidyverse Cookbook
── Attaching packages ───────────────────────────── tidyverse 1.3.1 ──
✓ ggplot2 3.3.3 ✓ purrr 0.3.4
✓ tibble 3.1.2 ✓ dplyr 1.0.6
✓ tidyr 1.1.3 ✓ stringr 1.4.0
✓ readr 1.4.0 ✓ forcats 0.5.1
── Conflicts ──────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
tibble::tribble
is a useful function if you want to manually input a small amount of data.
tibble::tribble( ~column1, ~column2, ~column3,
"a", 1, TRUE,
"b", 2, FALSE)
# A tibble: 2 x 3
column1 column2 column3
<chr> <dbl> <lgl>
1 a 1 TRUE
2 b 2 FALSE
tibble::glimpse
is similar to str
but can be embedded in pipelines as it invisibly returns the original data.
tibble::glimpse(mtcars)
Rows: 32
Columns: 11
$ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 1…
$ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7,…
$ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3…
$ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190,…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00,…
$ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1…
$ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2…
Use tribble()
to manually input some data
dplyr::rename
provides the renaming ability of dplyr::select
but it doesn’t subset.
dplyr::transmute
is similar to dplyr::mutate
except only columns specified are kept.
dplyr::everything
/dplyr::starts_with
/dplyr::contains
/dplyr::ends_with
are useful helper functions to enable easy selection of columns. These functions technically live in the tidyselect
package, along with tidyselect::where
. tidyselect::where
can be used to supply a conditional type function such as is.numeric
to enable the selection of columns.
dplyr::relocate
is a very useful function that enables you to relocate columns to specific locations. It uses a .before
or .after
argument to specify where you to more the column(s) to.
dplyr::across
lets you apply a function or functions across multiple columns. For applying a function based on data type, pair with tidyselect::where
. This is usually used as part of a dplyr::summarise
or dplyr::mutate
statement.
dplyr::c_across
lets you combine values from across multiple columns, such as performing a summarising function on select columns.
dplyr::case_when
is useful in situations where you would have multiple ifelses.
dplyr::starwars
dataset, calculate the mean across all numeric columns for each species.tidyr::nest
will for per row, condense the specified columns into a list, which is stored in a single column (list column). tidyr::unnest
will take the elements from a list column into their own columns. These functions are particularly useful in conjunction with pivot_longer
/pivot_wider
tidyr::unite
/tidyr::separate
will either combine columns with a separator, or split to new columns based on a separator. tidyr::separate_rows
will separate a column based on a separator but into separate rows, duplicating the related row data with it.
tidyr::crossing
provides a mechanism to create all permutations by ‘crossing’ the values in two vectors. This is particularly useful when combined with purrr
for running a function on all sub-groups within your data.
tidyr::drop_na
is function that provides a way to remove rows that contain missing data. If a column(s) is specified rows are only removed if the missing data is in the specific columns.
tidyr::replace_na
is a function that lets you specify replacement values for your missing values. The values are specified using a named list, with the names corresponding to the columns your want to replace the data in.
Create a data.frame/tibble from the starwars dataset which shows one character/film combo per row for all the films and characters.
# starwars dataset
dplyr::starwars
stringr::str_detect
lets you use regular expressions or straight text to check to see if it is in any of the values of a column and returns a logical vector. This function is very useful to use as part of a dplyr::filter
.
stringr::str_remove
will remove the first instance of the text that matches the pattern from the values in your column, stringr::str_remove_all
will remove all instances.
stringr::str_extract
does the opposite of stringr::str_remove
, in that it returns the first instance of the text that matches the pattern, stringr::str_extract_all
will return all instances of the matches.
From the dplyr::starwars data, find all of the characters that have ‘grey’ as part of their skin colour description. How many different descriptions are there that contain ‘grey’?
readr::read_csv_chunked
readr::parse_number
is an extremely useful function to know about if you are reading data into R that you know is numerical in nature but might contain extra characters such as units.
text_to_parse <- c(" 0.4m", "-6", "a5", "1E-2", "24%", "3e2")
readr::parse_number(text_to_parse)
[1] 0.40 -6.00 5.00 0.01 24.00 300.00
The majority of ggplot2
functions are specific to the type of visualisation that is being created and well covered by the cheatsheet. There is one however that can be very helpful to know about.
ggplot2::theme_set
allows you to apply a theme to all of your ggplots. It would usually be called near the start of a script.
# Set the theme to theme_bw for all following plots
ggplot2::theme_set(theme_bw())
forcats::reorder
lets you reorder your factor levels by sorting against another variable. forcats::reorder2
is the same as reorder
but you can use two variables. The default function for sorting is based on the median.
df <- tibble::tribble(
~color, ~a, ~b,
"blue", 1, 2,
"green", 6, 2,
"purple", 3, 3,
"red", 2, 3,
"yellow", 5, 1
)
df$color <- factor(df$color)
fct_reorder(df$color, df$a, min)
[1] blue green purple red yellow
Levels: blue red purple yellow green
fct_reorder2(df$color, df$a, df$b)
[1] blue green purple red yellow
Levels: purple red blue green yellow
forcats::infreq
will reorder the factor levels based on the frequency of the observations (highest first), and forcats::fct_rev
will reverse that order (lowest first).
forcats::relevel
lets you manually reorder the levels in your factor. This is useful for when you want to move a particular level such as NA
to the end.
[1] blue green blue none blue purple purple
Levels: blue green none purple
fct_relevel(my_fct, "none")
[1] blue green blue none blue purple purple
Levels: none blue green purple
# send to the end
fct_relevel(my_fct, "none", after = Inf)
[1] blue green blue none blue purple purple
Levels: blue green purple none
Alter the following code that creates a bar plot to use forcats
to re-order the bars the species of penguins are ordered by frequency, lowest on the left.
library(tidyverse)
library(palmerpenguins)
penguins %>%
ggplot(aes(x = species)) + geom_bar()
We’ll cover purrr
in Functional Programming
library(googlesheets4)
# grab the url from the browser for your sheet (specific to a sheet)
url <- "https://docs.google.com/spreadsheets/d/1MbE2_XUfQ9KwfKAJhEDPb6KgOg2EaoXr5IN2F-hjBNI/edit#gid=0"
# read the sheet in
my_google_sheet <- read_sheet(url)
The first time running read_sheet
you will be asked to authenticate and a web broswer will open up.
lubridate::ymd
magrittr::%$%
magrittr::%T%
glue::glue
From the package purrr
(part of the tidyverse), there are a collection of map
functions which are a method of iterating over a collection of things applying a function. This is known as functional programming, and allows us to extract the code that is in common for a loop, into a function, so rather than being concerned about the set-up of the loop, we can focus on the contents of the loop. This idea of mapping a function onto data is extremely similar to the concept underlying the for
loop.
The package purrr
within the tidyverse provides the map
functions that take a vector or list as the first argument, and the second argument is the function to be run on each item in the vector or list. The object that is returned back with the results depends on the exact version of map
that is called, the default map()
returns the results as items in a list, but there are suffix versions of map
that will return the results back in a specified data type.
map()
makes a list.map_lgl()
makes a logical vector.map_int()
makes an integer vector.map_dbl()
makes a double vector.map_chr()
makes a character vector.These suffix versions will give an error if the data type of the results doesn’t match. This is useful for being able to program with, as it means that you can be sure that you have a particular data type for future code. Some of the base R functions that you will meet in the next section don’t provide this guarantee. The arguments to the map
functions are .x
which is the vector or list input, and .f
which is the name of the function. if the supplied function takes multiple arguments these can be passed in as extra arguments to map
.
We could use the example of converting some temperatures to demonstrate
library(purrr)
farenheit_to_celcius <- function(temp_f){
temp_c <- (temp_f -32) * 5/9
return(temp_c)
}
my_temps_f <- c(90, 78, 88, 89, 77)
# gives back a list
my_temps_c_list <- map(.x = my_temps_f, .f = farenheit_to_celcius)
my_temps_c_list
[[1]]
[1] 32.22222
[[2]]
[1] 25.55556
[[3]]
[1] 31.11111
[[4]]
[1] 31.66667
[[5]]
[1] 25
# gives back a vector of type numeric/double
my_temps_c_dbl <- map_dbl(.x = my_temps_f, .f = farenheit_to_celcius)
my_temps_c_dbl
[1] 32.22222 25.55556 31.11111 31.66667 25.00000
When using map
and variants, don’t include the ()
’s on the function name, if you do you’ll get this error:
map(.x = my_temps_f, .f = farenheit_to_celcius())
Error in farenheit_to_celcius(): argument "temp_f" is missing, with no default
You can create anonymous functions for purrr
to use.
For instance, what if you had a list containing multiple weeks of temperatures and you wanted to find the mean temp per week.
monthly_temps <- list(week1 = c(20,21,23,NA),
week2 = c(15,16,14,17,20),
week3 = c(17,15,NA,17,18,14))
map(monthly_temps, mean)
$week1
[1] NA
$week2
[1] 16.4
$week3
[1] NA
Because we have NA
s mean returns NA
. We could define a new version of mean
that removes NA
s
$week1
[1] 21.33333
$week2
[1] 16.4
$week3
[1] 16.2
But we can also use the formula syntax to create an anonymous function using the ~
, and with the similar idea as .
for data in pipes, we use .x
to represent the data in our function.