Introduction to recipes

This document demonstrates some basic uses of recipes. First, some definitions are required:

Preprocessing Steps

From here, preprocessing steps for some step X can be added sequentially in one of two ways:

rec_obj <- step_{X}(rec_obj, arguments)    ## or
rec_obj <- rec_obj %>% step_{X}(arguments)

step_dummy and the other functions will always return updated recipes.

One other important facet of the code is the method for specifying which variables should be used in different steps. The manual page ?selections has more details but dplyr-like selector functions can be used:

use basic variable names (e.g. x1, x2),
dplyr functions for selecting variables: contains(), ends_with(), everything(), matches(), num_range(), and starts_with(),
functions that subset on the role of the variables that have been specified so far: all_outcomes(), all_predictors(), has_role(),
similar functions for the type of data: all_nominal(), all_numeric(), and has_type(), or
compound selectors such as all_nominal_predictors() or all_numeric_predictors().

Note that the methods listed above are the only ones that can be used to select variables inside the steps. Also, minus signs can be used to deselect variables.

For our data, we can add an operation to impute the predictors. There are many ways to do this and recipes includes a few steps for this purpose:

grep("impute_", ls("package:recipes"), value = TRUE)
#> [1] "step_impute_bag"    "step_impute_knn"    "step_impute_linear"
#> [4] "step_impute_lower"  "step_impute_mean"   "step_impute_median"
#> [7] "step_impute_mode"   "step_impute_roll"

Here, K-nearest neighbor imputation will be used. This works for both numeric and non-numeric predictors and defaults K to five To do this, it selects all predictors and then removes those that are numeric:

imputed <- rec_obj %>%
  step_impute_knn(all_predictors()) 
imputed
#> Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor         13
#> 
#> Operations:
#> 
#> K-nearest neighbor imputation for all_predictors()

It is important to realize that the specific variables have not been declared yet (as shown when the recipe is printed above). In some preprocessing steps, variables will be added or removed from the current list of possible variables.

Since some predictors are categorical in nature (i.e. nominal), it would make sense to convert these factor predictors into numeric dummy variables (aka indicator variables) using step_dummy(). To do this, the step selects all non-numeric predictors:

ind_vars <- imputed %>%
  step_dummy(all_nominal_predictors()) 
ind_vars
#> Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor         13
#> 
#> Operations:
#> 
#> K-nearest neighbor imputation for all_predictors()
#> Dummy variables from all_nominal_predictors()

At this point in the recipe, all of the predictor should be encoded as numeric, we can further add more steps to center and scale them:

standardized <- ind_vars %>%
  step_center(all_numeric_predictors()) %>%
  step_scale(all_numeric_predictors()) 
standardized
#> Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor         13
#> 
#> Operations:
#> 
#> K-nearest neighbor imputation for all_predictors()
#> Dummy variables from all_nominal_predictors()
#> Centering for all_numeric_predictors()
#> Scaling for all_numeric_predictors()

If these are the only preprocessing steps for the predictors, we can now estimate the means and standard deviations from the training set. The prep function is used with a recipe and a data set:

trained_rec <- prep(standardized, training = credit_train)
trained_rec
#> Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor         13
#> 
#> Training data contained 3340 data points and 322 incomplete rows. 
#> 
#> Operations:
#> 
#> K-nearest neighbor imputation for Seniority, Home, Time, Age, Marital, Records, ... [trained]
#> Dummy variables from Home, Marital, Records, Job [trained]
#> Centering for Seniority, Time, Age, Expenses, Income, Assets,... [trained]
#> Scaling for Seniority, Time, Age, Expenses, Income, Assets,... [trained]

Note that the real variables are listed (e.g. Home etc.) instead of the selectors (all_numeric_predictors()).

Now that the statistics have been estimated, the preprocessing can be applied to the training and test set:

train_data <- bake(trained_rec, new_data = credit_train)
test_data  <- bake(trained_rec, new_data = credit_test)

bake returns a tibble that, by default, includes all of the variables:

class(test_data)
#> [1] "tbl_df"     "tbl"        "data.frame"
test_data
#> # A tibble: 1,114 × 23
#>    Seniority   Time    Age Expenses  Income Assets   Debt  Amount    Price
#>        <dbl>  <dbl>  <dbl>    <dbl>   <dbl>  <dbl>  <dbl>   <dbl>    <dbl>
#>  1     1.09   0.924  1.88    -0.385 -0.131  -0.488 -0.295 -0.0817  0.297  
#>  2    -0.977  0.924 -0.459    1.77  -0.437   0.845 -0.295  0.333   0.760  
#>  3    -0.977  0.103  0.349    1.77  -0.783  -0.488 -0.295  0.333   0.00254
#>  4    -0.247  0.103 -0.280    0.231 -0.207  -0.133 -0.295  0.229   0.171  
#>  5    -0.125 -0.718 -0.729    0.231 -0.258  -0.222 -0.295 -0.807  -0.854  
#>  6    -0.855  0.924 -0.549   -1.05  -0.0539 -0.488 -0.295  0.436  -0.331  
#>  7     2.31   0.924  0.349    0.949 -0.0155 -0.488 -0.295 -0.185   0.0475 
#>  8     0.848 -0.718  0.529    1.00   1.40   -0.133 -0.295  1.58    1.69   
#>  9    -0.977 -0.718 -1.27    -0.538 -0.246  -0.266 -0.295 -1.32   -1.65   
#> 10    -0.855  0.514 -0.100    0.744 -0.540  -0.488 -0.295 -0.185  -0.800  
#> # … with 1,104 more rows, and 14 more variables: Status <fct>, Home_X1 <dbl>,
#> #   Home_X2 <dbl>, Home_X3 <dbl>, Home_X4 <dbl>, Home_X5 <dbl>,
#> #   Marital_X1 <dbl>, Marital_X2 <dbl>, Marital_X3 <dbl>, Marital_X4 <dbl>,
#> #   Records_X1 <dbl>, Job_X1 <dbl>, Job_X2 <dbl>, Job_X3 <dbl>
vapply(test_data, function(x) mean(!is.na(x)), numeric(1))
#>  Seniority       Time        Age   Expenses     Income     Assets       Debt 
#>          1          1          1          1          1          1          1 
#>     Amount      Price     Status    Home_X1    Home_X2    Home_X3    Home_X4 
#>          1          1          1          1          1          1          1 
#>    Home_X5 Marital_X1 Marital_X2 Marital_X3 Marital_X4 Records_X1     Job_X1 
#>          1          1          1          1          1          1          1 
#>     Job_X2     Job_X3 
#>          1          1

Selectors can also be used. For example, if only the predictors are needed, you can use bake(object, new_data, all_predictors()).

There are a number of other steps included in the package:

#>  [1] "step_BoxCox"             "step_YeoJohnson"        
#>  [3] "step_arrange"            "step_bagimpute"         
#>  [5] "step_bin2factor"         "step_bs"                
#>  [7] "step_center"             "step_classdist"         
#>  [9] "step_corr"               "step_count"             
#> [11] "step_cut"                "step_date"              
#> [13] "step_depth"              "step_discretize"        
#> [15] "step_dummy"              "step_dummy_extract"     
#> [17] "step_dummy_multi_choice" "step_factor2string"     
#> [19] "step_filter"             "step_filter_missing"    
#> [21] "step_geodist"            "step_harmonic"          
#> [23] "step_holiday"            "step_hyperbolic"        
#> [25] "step_ica"                "step_impute_bag"        
#> [27] "step_impute_knn"         "step_impute_linear"     
#> [29] "step_impute_lower"       "step_impute_mean"       
#> [31] "step_impute_median"      "step_impute_mode"       
#> [33] "step_impute_roll"        "step_indicate_na"       
#> [35] "step_integer"            "step_interact"          
#> [37] "step_intercept"          "step_inverse"           
#> [39] "step_invlogit"           "step_isomap"            
#> [41] "step_knnimpute"          "step_kpca"              
#> [43] "step_kpca_poly"          "step_kpca_rbf"          
#> [45] "step_lag"                "step_lincomb"           
#> [47] "step_log"                "step_logit"             
#> [49] "step_lowerimpute"        "step_meanimpute"        
#> [51] "step_medianimpute"       "step_modeimpute"        
#> [53] "step_mutate"             "step_mutate_at"         
#> [55] "step_naomit"             "step_nnmf"              
#> [57] "step_nnmf_sparse"        "step_normalize"         
#> [59] "step_novel"              "step_ns"                
#> [61] "step_num2factor"         "step_nzv"               
#> [63] "step_ordinalscore"       "step_other"             
#> [65] "step_pca"                "step_percentile"        
#> [67] "step_pls"                "step_poly"              
#> [69] "step_profile"            "step_range"             
#> [71] "step_ratio"              "step_regex"             
#> [73] "step_relevel"            "step_relu"              
#> [75] "step_rename"             "step_rename_at"         
#> [77] "step_rm"                 "step_rollimpute"        
#> [79] "step_sample"             "step_scale"             
#> [81] "step_select"             "step_shuffle"           
#> [83] "step_slice"              "step_spatialsign"       
#> [85] "step_sqrt"               "step_string2factor"     
#> [87] "step_time"               "step_unknown"           
#> [89] "step_unorder"            "step_window"            
#> [91] "step_zv"

Introduction to recipes

An Example

An Initial Recipe

Preprocessing Steps

Checks