In this tutorial we will learn how to use dplyr within functions. For this we will use movie data.
Learning Objectives: use dplyr within functions
Level: advanced
Data Preparation
The data we will analyse is part of ggplot2movies. We load the packages dplyr and ggplot2movies with pacman.
library(pacman)
p_load(dplyr, ggplot2movies)
data("movies")
Check with packageVersion() that dplyr version at 0.7 or above.
packageVersion("dplyr")
## [1] '0.7.8'
Only for sake of a nicer plot some variables are left out and only these variables are kept.
movies_filt <- movies %>%
select(title, rating, length, Action, Animation, Comedy, Drama, Documentary, Romance, Short)
knitr::kable(movies_filt %>% head(5))
title | rating | length | Action | Animation | Comedy | Drama | Documentary | Romance | Short |
---|---|---|---|---|---|---|---|---|---|
$ | 6.4 | 121 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
$1000 a Touchdown | 6.0 | 71 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
$21 a Day Once a Month | 8.2 | 7 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
$40,000 | 8.2 | 70 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
$50,000 Climax Show, The | 3.4 | 71 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Problem
Now assume you want to know the average length or rating of each of these 7 genres. There are 14 combinations, which means a lot of repeated code, which should be avoided. How can you avoid this? By using dplyr within a function.
Solution
The problem is that we want to pass a column name, that should be evaluated at a later step.
The first intuitive approach is this:
genre <- "Action"
movies_filt %>%
group_by(genre) %>%
summarise (rating = mean(rating))
This didn’t work. You have to make a quosure of the column name with quo() and unquosure it when it is used with !!.
genre <- quo(Comedy)
movies_filt %>%
filter((!!genre) == 1) %>%
summarise (rating = mean(rating))
## # A tibble: 1 x 1 ## rating ## <dbl> ## 1 5.96
Now we use this in a function.
genre_stat_mean <- function(df,
group_var = c("Action", "Animation", "Comedy", "Drama", "Documentary", "Romance", "Short"),
stat_var = c("rating", "length") ) {
group_var <- enquo (group_var)
stat_var <- enquo (stat_var)
col_name <- paste0("median_",
as.character(stat_var)[2])
df %>%
filter((!!group_var) == 1) %>%
summarise(!!col_name := median((!!stat_var)))
}
How does it work:
- Passed Parameters are a dataframe df, a grouping variable group_var, which is only allowed to have certain values to avoid errors, and a variable which is used for calculating the mean statistics stat_var which can either be “rating” or “length”.
- group_var and stat_var are transformed into a quosure with enquo(). Did you recognize that we use enquo() instead of quo(). If you want to make a quosure within a function you have to use enquo() and not quo()!
- col_name is the name of the column that is returned, e.g. “mean_rating”. Important: after casting stat_var into a character two values are returned, which is why the second has to be specified with [2].
- Now we use the dataframe df, pipe it to the filter() function. Here we filter each genre to be equal 1. Important: you have to use brackets around !!group_var!
- Within summarise() call at first column name for the returning value is defined. Important: the assignment is not make with “=”, but instead with “:=”, because we have quosures on both sides!
Now we can put the function to a test. We calculate the average rating of animated films. And we calculate the average length of Comedies.
genre_stat_mean(df = movies,
group_var = Animation,
stat_var = rating)
## # A tibble: 1 x 1 ## median_rating ## <dbl> ## 1 6.7
genre_stat_mean(df = movies,
group_var = Comedy,
stat_var = length)
## # A tibble: 1 x 1 ## median_length ## <int> ## 1 89
Summary
We learned to use dplyr within functions. Now we know when to use quo() and when to use enquo(), as well as when to use “:=” instead of “=”.
This knowledge will enable you to write more powerful code which is less error prone, because you can avoid to repeat your code many times.