When recipe steps are used, there are different approaches that can be used to select which variables or features should be used.
The three main characteristics of variables that can be queried:
The manual pages for ?selections
and
?has_role
have details about the available selection
methods.
To illustrate this, the credit data will be used:
library(recipes)
library(modeldata)
data("credit_data")
str(credit_data)
#> 'data.frame': 4454 obs. of 14 variables:
#> $ Status : Factor w/ 2 levels "bad","good": 2 2 1 2 2 2 2 2 2 1 ...
#> $ Seniority: int 9 17 10 0 0 1 29 9 0 0 ...
#> $ Home : Factor w/ 6 levels "ignore","other",..: 6 6 3 6 6 3 3 4 3 4 ...
#> $ Time : int 60 60 36 60 36 60 60 12 60 48 ...
#> $ Age : int 30 58 46 24 26 36 44 27 32 41 ...
#> $ Marital : Factor w/ 5 levels "divorced","married",..: 2 5 2 4 4 2 2 4 2 2 ...
#> $ Records : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 1 1 1 1 ...
#> $ Job : Factor w/ 4 levels "fixed","freelance",..: 2 1 2 1 1 1 1 1 2 4 ...
#> $ Expenses : int 73 48 90 63 46 75 75 35 90 90 ...
#> $ Income : int 129 131 200 182 107 214 125 80 107 80 ...
#> $ Assets : int 0 0 3000 2500 0 3500 10000 0 15000 0 ...
#> $ Debt : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ Amount : int 800 1000 2000 900 310 650 1600 200 1200 1200 ...
#> $ Price : int 846 1658 2985 1325 910 1645 1800 1093 1957 1468 ...
<- recipe(Status ~ Seniority + Time + Age + Records, data = credit_data)
rec
rec#> Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 4
Before any steps are used the information on the original variables is:
summary(rec, original = TRUE)
#> # A tibble: 5 × 5
#> variable type role source required_to_bake
#> <chr> <chr> <chr> <chr> <lgl>
#> 1 Seniority numeric predictor original TRUE
#> 2 Time numeric predictor original TRUE
#> 3 Age numeric predictor original TRUE
#> 4 Records nominal predictor original TRUE
#> 5 Status nominal outcome original FALSE
We can add a step to compute dummy variables on the non-numeric data after we impute any missing data:
<- rec %>% step_dummy(all_nominal()) dummied
This will capture any variables that are either character
strings or factors: Status
and Records
.
However, since Status
is our outcome, we might want to keep
it as a factor so we can subtract that variable out either by
name or by role:
<- rec %>% step_dummy(Records) # or
dummied <- rec %>% step_dummy(all_nominal(), - Status) # or
dummied <- rec %>% step_dummy(all_nominal(), - all_outcomes()) dummied
Using the last definition:
<- prep(dummied, training = credit_data)
dummied <- bake(dummied, new_data = credit_data)
with_dummy
with_dummy#> # A tibble: 4,454 × 5
#> Seniority Time Age Status Records_X1
#> <int> <int> <int> <fct> <dbl>
#> 1 9 60 30 good -1
#> 2 17 60 58 good -1
#> 3 10 36 46 bad 1
#> 4 0 60 24 good -1
#> 5 0 36 26 good -1
#> 6 1 60 36 good -1
#> 7 29 60 44 good -1
#> 8 9 12 27 good -1
#> 9 0 60 32 good -1
#> 10 0 48 41 bad -1
#> # … with 4,444 more rows
Status
is unaffected.
One important aspect about selecting variables in steps is that the
variable names and types may change as steps are being executed. In the
above example, Records
is a factor variable before the step
is executed. Afterwards, Records
is gone and the binary
variable Records_yes
is in its place. One reason to have
general selection routines like all_predictors()
or
contains()
is to be able to select variables that have not
be created yet.