recipes
can assign one or more roles to each column in
the data. The roles are not restricted to a predefined set; they can be
anything. For most conventional situations, they are typically
“predictor” and/or “outcome”. Additional roles enable targeted step
operations on specific variables or groups of variables.
When a recipe is created using the formula interface, this defines
the roles for all columns of the data set. summary()
can be
used to view a tibble containing information regarding the roles.
library(recipes)
recipe(Species ~ ., data = iris) %>% summary()
#> # A tibble: 6 × 4
#> variable type role source
#> <chr> <chr> <chr> <chr>
#> 1 Sepal.Length numeric predictor original
#> 2 Sepal.Width numeric predictor original
#> 3 Petal.Length numeric predictor original
#> 4 Petal.Width numeric predictor original
#> 5 original nominal predictor original
#> 6 Species nominal outcome original
recipe( ~ Species, data = iris) %>% summary()
#> # A tibble: 1 × 4
#> variable type role source
#> <chr> <chr> <chr> <chr>
#> 1 Species nominal predictor original
recipe(Sepal.Length + Sepal.Width ~ ., data = iris) %>% summary()
#> # A tibble: 6 × 4
#> variable type role source
#> <chr> <chr> <chr> <chr>
#> 1 Petal.Length numeric predictor original
#> 2 Petal.Width numeric predictor original
#> 3 Species nominal predictor original
#> 4 original nominal predictor original
#> 5 Sepal.Length numeric outcome original
#> 6 Sepal.Width numeric outcome original
These roles can be updated despite this initial assignment.
update_role()
can modify a single existing role:
library(modeldata)
data(biomass)
recipe(HHV ~ ., data = biomass) %>%
update_role(dataset, new_role = "dataset split variable") %>%
update_role(sample, new_role = "sample ID") %>%
summary()
#> # A tibble: 8 × 4
#> variable type role source
#> <chr> <chr> <chr> <chr>
#> 1 sample nominal sample ID original
#> 2 dataset nominal dataset split variable original
#> 3 carbon numeric predictor original
#> 4 hydrogen numeric predictor original
#> 5 oxygen numeric predictor original
#> 6 nitrogen numeric predictor original
#> 7 sulfur numeric predictor original
#> 8 HHV numeric outcome original
When you want to get rid of a role for a column, use
remove_role()
.
recipe(HHV ~ ., data = biomass) %>%
remove_role(sample, old_role = "predictor") %>%
summary()
#> # A tibble: 8 × 4
#> variable type role source
#> <chr> <chr> <chr> <chr>
#> 1 sample nominal <NA> original
#> 2 dataset nominal predictor original
#> 3 carbon numeric predictor original
#> 4 hydrogen numeric predictor original
#> 5 oxygen numeric predictor original
#> 6 nitrogen numeric predictor original
#> 7 sulfur numeric predictor original
#> 8 HHV numeric outcome original
It represents the lack of a role as NA
, which means that
the variable is used in the recipe, but does not yet have a declared
role. Setting the role manually to NA
is not allowed:
recipe(HHV ~ ., data = biomass) %>%
update_role(sample, new_role = NA_character_)
#> Error in `single_chr()`:
#> ! `new_role` must not be `NA`.
When there are cases when a column will be used in more than one
context, add_role()
can create additional roles:
<- recipe(HHV ~ ., data = biomass) %>%
multi_role update_role(dataset, new_role = "dataset split variable") %>%
update_role(sample, new_role = "sample ID") %>%
# Roles below from https://wordcounter.net/random-word-generator
add_role(sample, new_role = "jellyfish")
%>%
multi_role summary()
#> # A tibble: 9 × 4
#> variable type role source
#> <chr> <chr> <chr> <chr>
#> 1 sample nominal sample ID original
#> 2 sample nominal jellyfish original
#> 3 dataset nominal dataset split variable original
#> 4 carbon numeric predictor original
#> 5 hydrogen numeric predictor original
#> 6 oxygen numeric predictor original
#> 7 nitrogen numeric predictor original
#> 8 sulfur numeric predictor original
#> 9 HHV numeric outcome original
If a variable has multiple existing roles and you want to update one
of them, the additional old_role
argument to
update_role()
must be used to resolve any ambiguity.
%>%
multi_role update_role(sample, new_role = "flounder", old_role = "jellyfish") %>%
summary()
#> # A tibble: 9 × 4
#> variable type role source
#> <chr> <chr> <chr> <chr>
#> 1 sample nominal sample ID original
#> 2 sample nominal flounder original
#> 3 dataset nominal dataset split variable original
#> 4 carbon numeric predictor original
#> 5 hydrogen numeric predictor original
#> 6 oxygen numeric predictor original
#> 7 nitrogen numeric predictor original
#> 8 sulfur numeric predictor original
#> 9 HHV numeric outcome original
Additional variable roles allow you to use has_role()
in
combination with other selection methods (see ?selections
)
to target specific variables in subsequent processing steps. For
example, in the following recipe, by adding the role
"nocenter"
to the HHV
predictor, you can use
-has_role("nocenter")
to exclude HHV
when
centering all_predictors()
.
%>%
multi_role add_role(HHV, new_role = "nocenter") %>%
step_center(all_predictors(), -has_role("nocenter")) %>%
prep(training = biomass, retain = TRUE) %>%
bake(new_data = NULL) %>%
head()
#> # A tibble: 6 × 8
#> sample dataset carbon hydrogen oxygen nitrogen sulfur HHV
#> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Akhrot Shell Training 1.52 0.181 4.37 -0.667 -0.234 20.0
#> 2 Alabama Oak Wood Waste Training 1.21 0.241 2.73 -0.877 -0.234 19.2
#> 3 Alder Training -0.475 0.341 7.68 -0.967 -0.214 18.3
#> 4 Alfalfa Training -3.19 -0.489 -2.97 2.22 -0.0736 18.2
#> 5 Alfalfa Seed Straw Training -1.53 -0.0586 2.15 -0.0772 -0.214 18.4
#> 6 Alfalfa Stalks Training -2.89 0.291 1.63 0.963 -0.134 18.5
The selector all_numeric_predictors()
can also be used
in place of the compound specification above.
You can start a recipe without any roles:
recipe(biomass) %>%
summary()
#> # A tibble: 8 × 4
#> variable type role source
#> <chr> <chr> <chr> <chr>
#> 1 sample nominal <NA> original
#> 2 dataset nominal <NA> original
#> 3 carbon numeric <NA> original
#> 4 hydrogen numeric <NA> original
#> 5 oxygen numeric <NA> original
#> 6 nitrogen numeric <NA> original
#> 7 sulfur numeric <NA> original
#> 8 HHV numeric <NA> original
and roles can be added in bulk as needed:
recipe(biomass) %>%
update_role(contains("gen"), new_role = "lunchroom") %>%
update_role(sample, HHV, new_role = "snail") %>%
summary()
#> # A tibble: 8 × 4
#> variable type role source
#> <chr> <chr> <chr> <chr>
#> 1 sample nominal snail original
#> 2 dataset nominal <NA> original
#> 3 carbon numeric <NA> original
#> 4 hydrogen numeric lunchroom original
#> 5 oxygen numeric lunchroom original
#> 6 nitrogen numeric lunchroom original
#> 7 sulfur numeric <NA> original
#> 8 HHV numeric snail original
All recipes steps have a role
argument that lets you set
the role of new columns generated by the step. When a recipe
modifies a column in-place, the role is never modified. For example,
?step_center
has the documentation:
role
: Not used by this step since no new variables are created
In other cases, the roles are defaulted to a relevant value based the
context. For example, ?step_dummy
has
role
: For model terms created by this step, what analysis role should they be assigned?. By default, the function assumes that the binary dummy variable columns created by the original variables will be used as predictors in a model.
So, by default, they are predictors but don’t have to be:
recipe( ~ ., data = iris) %>%
step_dummy(Species) %>%
prep() %>%
bake(new_data = NULL, all_predictors()) %>%
::select(starts_with("Species")) %>%
dplyrnames()
#> [1] "Species_X1" "Species_X2"
# or something else
recipe( ~ ., data = iris) %>%
step_dummy(Species, role = "trousers") %>%
prep() %>%
bake(new_data = NULL, has_role("trousers")) %>%
names()
#> [1] "Species_X1" "Species_X2"