Most dplyr verbs use tidy evaluation in some way. Tidy evaluation is a special type of non-standard evaluation used throughout the tidyverse. There are two basic forms found in dplyr:
arrange()
, count()
,
filter()
, group_by()
, mutate()
,
and summarise()
use data masking so that
you can use data variables as if they were variables in the environment
(i.e. you write my_variable
not
df$myvariable
).
across()
, relocate()
,
rename()
, select()
, and pull()
use tidy selection so you can easily choose variables
based on their position, name, or type
(e.g. starts_with("x")
or
is.numeric
).
To determine whether a function argument uses data masking or tidy
selection, look at the documentation: in the arguments list, you’ll see
<data-masking>
or
<tidy-select>
.
Data masking and tidy selection make interactive data exploration fast and fluid, but they add some new challenges when you attempt to use them indirectly such as in a for loop or a function. This vignette shows you how to overcome those challenges. We’ll first go over the basics of data masking and tidy selection, talk about how to use them indirectly, and then show you a number of recipes to solve common problems.
This vignette will give you the minimum knowledge you need to be an effective programmer with tidy evaluation. If you’d like to learn more about the underlying theory, or precisely how it’s different from non-standard evaluation, we recommend that you read the Metaprogramming chapters in Advanced R.
library(dplyr)
Data masking makes data manipulation faster because it requires less
typing. In most (but not all1) base R functions you need to refer to
variables with $
, leading to code that repeats the name of
the data frame many times:
$homeworld == "Naboo" & starwars$species == "Human", ,] starwars[starwars
The dplyr equivalent of this code is more concise because data
masking allows you to need to type starwars
once:
%>% filter(homeworld == "Naboo", species == "Human") starwars
The key idea behind data masking is that it blurs the line between the two different meanings of the word “variable”:
env-variables are “programming” variables that
live in an environment. They are usually created with
<-
.
data-variables are “statistical” variables that
live in a data frame. They usually come from data files
(e.g. .csv
, .xls
), or are created manipulating
existing variables.
To make those definitions a little more concrete, take this piece of code:
<- data.frame(x = runif(3), y = runif(3))
df $x
df#> [1] 0.08075014 0.83433304 0.60076089
It creates a env-variable, df
, that contains two
data-variables, x
and y
. Then it extracts the
data-variable x
out of the env-variable df
using $
.
I think this blurring of the meaning of “variable” is a really nice
feature for interactive data analysis because it allows you to refer to
data-vars as is, without any prefix. And this seems to be fairly
intuitive since many newer R users will attempt to write
diamonds[x == 0 | y == 0, ]
.
Unfortunately, this benefit does not come for free. When you start to program with these tools, you’re going to have to grapple with the distinction. This will be hard because you’ve never had to think about it before, so it’ll take a while for your brain to learn these new concepts and categories. However, once you’ve teased apart the idea of “variable” into data-variable and env-variable, I think you’ll find it fairly straightforward to use.
The main challenge of programming with functions that use data masking arises when you introduce some indirection, i.e. when you want to get the data-variable from an env-variable instead of directly typing the data-variable’s name. There are two main cases:
When you have the data-variable in a function argument (i.e. an
env-variable that holds a promise2), you need to embrace the
argument by surrounding it in doubled braces, like
filter(df, {{ var }})
.
The following function uses embracing to create a wrapper around
summarise()
that computes the minimum and maximum values of
a variable, as well as the number of observations that were
summarised:
<- function(data, var) {
var_summary %>%
data summarise(n = n(), min = min({{ var }}), max = max({{ var }}))
}%>%
mtcars group_by(cyl) %>%
var_summary(mpg)
When you have an env-variable that is a character vector, you
need to index into the .data
pronoun with [[
,
like summarise(df, mean = mean(.data[[var]]))
.
The following example uses .data
to count the number of
unique values in each variable of mtcars
:
for (var in names(mtcars)) {
%>% count(.data[[var]]) %>% print()
mtcars }
Note that .data
is not a data frame; it’s a special
construct, a pronoun, that allows you to access the current variables
either directly, with .data$x
or indirectly with
.data[[var]]
. Don’t expect other functions to work with
it.
Data masking makes it easy to compute on values within a dataset. Tidy selection is a complementary tool that makes it easy to work with the columns of a dataset.
Underneath all functions that use tidy selection is the tidyselect package. It provides a miniature domain specific language that makes it easy to select columns by name, position, or type. For example:
select(df, 1)
selects the first column;
select(df, last_col())
selects the last column.
select(df, c(a, b, c))
selects columns
a
, b
, and c
.
select(df, starts_with("a"))
selects all columns
whose name starts with “a”; select(df, ends_with("z"))
selects all columns whose name ends with “z”.
select(df, where(is.numeric))
selects all numeric
columns.
You can see more details in ?dplyr_tidy_select
.
As with data masking, tidy selection makes a common task easier at the cost of making a less common task harder. When you want to use tidy select indirectly with the column specification stored in an intermediate variable, you’ll need to learn some new tools. Again, there are two forms of indirection:
When you have the data-variable in an env-variable that is a function argument, you use the same technique as data masking: you embrace the argument by surrounding it in doubled braces.
The following function summarises a data frame by computing the mean of all variables selected by the user:
<- function(data, vars) {
summarise_mean %>% summarise(n = n(), across({{ vars }}, mean))
data
}%>%
mtcars group_by(cyl) %>%
summarise_mean(where(is.numeric))
When you have an env-variable that is a character vector, you
need to use all_of()
or any_of()
depending on
whether you want the function to error if a variable is not found.
The following code uses all_of()
to select all of the
variables found in a character vector; then !
plus
all_of()
to select all of the variables not found
in a character vector:
<- c("mpg", "vs")
vars %>% select(all_of(vars))
mtcars %>% select(!all_of(vars)) mtcars
The following examples solve a grab bag of common problems. We show you the minimum amount of code so that you can get the basic idea; most real problems will require more code or combining multiple techniques.
If you check the documentation, you’ll see that .data
never uses data masking or tidy select. That means you don’t need to do
anything special in your function:
<- function(data) {
mutate_y mutate(data, y = a + x)
}
R CMD check
NOTE
sIf you’re writing a package and you have a function that uses data-variables:
<- function(data) {
my_summary_function %>%
data filter(x > 0) %>%
group_by(grp) %>%
summarise(y = mean(y), n = n())
}
You’ll get an R CMD CHECK
NOTE
:
N checking R code for possible problems
my_summary_function: no visible binding for global variable ‘x’, ‘grp’, ‘y’
Undefined global functions or variables:
x grp y
You can eliminate this by using .data$var
and importing
.data
from its source in the rlang package (the underlying
package that implements tidy evaluation):
#' @importFrom rlang .data
<- function(data) {
my_summary_function %>%
data filter(.data$x > 0) %>%
group_by(.data$grp) %>%
summarise(y = mean(.data$y), n = n())
}
If you want the user to supply an expression that’s passed onto an argument which uses data masking or tidy select, embrace the argument:
<- function(data, group_var) {
my_summarise %>%
data group_by({{ group_var }}) %>%
summarise(mean = mean(mass))
}
This generalises in a straightforward way if you want to use one user-supplied expression in multiple places:
<- function(data, expr) {
my_summarise2 %>% summarise(
data mean = mean({{ expr }}),
sum = sum({{ expr }}),
n = n()
) }
If you want the user to provide multiple expressions, embrace each of them:
<- function(data, mean_var, sd_var) {
my_summarise3 %>%
data summarise(mean = mean({{ mean_var }}), sd = sd({{ sd_var }}))
}
If you want to use the names of variables in the output, you can use
glue syntax in conjunction with :=
:
<- function(data, expr) {
my_summarise4 %>% summarise(
data "mean_{{expr}}" := mean({{ expr }}),
"sum_{{expr}}" := sum({{ expr }}),
"n_{{expr}}" := n()
)
}<- function(data, mean_var, sd_var) {
my_summarise5 %>%
data summarise(
"mean_{{mean_var}}" := mean({{ mean_var }}),
"sd_{{sd_var}}" := sd({{ sd_var }})
) }
If you want to take an arbitrary number of user supplied expressions,
use ...
. This is most often useful when you want to give
the user full control over a single part of the pipeline, like a
group_by()
or a mutate()
.
<- function(.data, ...) {
my_summarise %>%
.data group_by(...) %>%
summarise(mass = mean(mass, na.rm = TRUE), height = mean(height, na.rm = TRUE))
}
%>% my_summarise(homeworld)
starwars #> # A tibble: 49 × 3
#> homeworld mass height
#> <chr> <dbl> <dbl>
#> 1 Alderaan 64 176.
#> 2 Aleen Minor 15 79
#> 3 Bespin 79 175
#> 4 Bestine IV 110 180
#> # … with 45 more rows
%>% my_summarise(sex, gender)
starwars #> `summarise()` has grouped output by 'sex'. You can override using the `.groups`
#> argument.
#> # A tibble: 6 × 4
#> # Groups: sex [5]
#> sex gender mass height
#> <chr> <chr> <dbl> <dbl>
#> 1 female feminine 54.7 169.
#> 2 hermaphroditic masculine 1358 175
#> 3 male masculine 81.0 179.
#> 4 none feminine NaN 96
#> # … with 2 more rows
When you use ...
in this way, make sure that any other
arguments start with .
to reduce the chances of argument
clashes; see https://design.tidyverse.org/dots-prefix.html for more
details.
If you want the user to provide a set of data-variables that are then
transformed, use across()
:
<- function(data, summary_vars) {
my_summarise %>%
data summarise(across({{ summary_vars }}, ~ mean(., na.rm = TRUE)))
}%>%
starwars group_by(species) %>%
my_summarise(c(mass, height))
#> # A tibble: 38 × 3
#> species mass height
#> <chr> <dbl> <dbl>
#> 1 Aleena 15 79
#> 2 Besalisk 102 198
#> 3 Cerean 82 198
#> 4 Chagrian NaN 196
#> # … with 34 more rows
You can use this same idea for multiple sets of input data-variables:
<- function(data, group_var, summarise_var) {
my_summarise %>%
data group_by(across({{ group_var }})) %>%
summarise(across({{ summarise_var }}, mean))
}
Use the .names
argument to across()
to
control the names of the output.
<- function(data, group_var, summarise_var) {
my_summarise %>%
data group_by(across({{ group_var }})) %>%
summarise(across({{ summarise_var }}, mean, .names = "mean_{.col}"))
}
If you have a character vector of variable names, and want to operate
on them with a for loop, index into the special .data
pronoun:
for (var in names(mtcars)) {
%>% count(.data[[var]]) %>% print()
mtcars }
This same technique works with for loop alternatives like the base R
apply()
family and the purrr map()
family:
%>%
mtcars names() %>%
::map(~ count(mtcars, .data[[.x]])) purrr
Many Shiny input controls return character vectors, so you can use
the same approach as above: .data[[input$var]]
.
library(shiny)
<- fluidPage(
ui selectInput("var", "Variable", choices = names(diamonds)),
tableOutput("output")
)<- function(input, output, session) {
server <- reactive(filter(diamonds, .data[[input$var]] > 0))
data $output <- renderTable(head(data()))
output }
See https://mastering-shiny.org/action-tidy.html for more details and case studies.
dplyr’s filter()
is inspired by base R’s
subset()
. subset()
provides data masking, but
not with tidy evaluation, so the techniques described in this chapter
don’t apply to it.↩︎
In R, arguments are lazily evaluated which means that until you attempt to use, they don’t hold a value, just a promise that describes how to compute the value. You can learn more at https://adv-r.hadley.nz/functions.html#lazy-evaluation↩︎