The purpose of this vignette is to provide a closer look at how the user-supplied model training and predict wrapper functions can be modified to give greater control over the model-building process. The goal is to present examples of how the wrapper functions could be flexibly written to keep a linear workflow in forecastML
while modeling across multiple forecast horizons and validation datasets. The alternative would be to train models across a single forecast horizon and/or validation window and customize the wrapper functions for this specific setup.
library(DT)
library(dplyr)
library(ggplot2)
library(forecastML)
library(randomForest)
data("data_seatbelts", package = "forecastML")
data <- data_seatbelts
data <- data[, c("DriversKilled", "kms", "PetrolPrice", "law")]
dates <- seq(as.Date("1969-01-01"), as.Date("1984-12-01"), by = "1 month")
data_train <- forecastML::create_lagged_df(data,
type = "train",
outcome_col = 1,
lookback = 1:12,
horizons = c(3, 12),
dates = dates,
frequency = "1 month")
# View the horizon 3 lagged dataset.
DT::datatable(head((data_train$horizon_3)), options = list("scrollX" = TRUE))
window_length = 0
means that a single validation dataset will span from window_start
to window_stop
.windows <- forecastML::create_windows(data_train, window_length = 0,
window_start = as.Date("1984-01-01"),
window_stop = as.Date("1984-12-01"))
plot(windows, data_train)
The key to customizing training across forecast horizons–here we have 2–is to modify the model training wrapper function based on the horizon-specific dataset in our lagged_df
object data_train
.
Each dataset’s forecast horizon is stored as an attribute.
## [1] 3
## [1] 12
We’ll train a Random Forest model with different settings for the 3-month and 12-month datasets.
The first argument to the user-defined model training function is always the horizon-specific dataset from create_lagged_df(type = "train")
and is passed into the wrapper function internally in train_model()
. Any number of additional parameters can be defined in this wrapper function by either (a) setting arguments here–like below–or (b) setting the arguments in train_model(...)
.
model_function <- function(data, my_outcome_col = 1, n_tree = c(200, 100)) {
outcome_names <- names(data)[my_outcome_col]
model_formula <- formula(paste0(outcome_names, "~ ."))
if (attributes(data)$horizon == 3) { # Model 1
model <- randomForest::randomForest(formula = model_formula,
data = data,
ntree = n_tree[1])
return(list("my_trained_model" = model, "n_tree" = n_tree[1],
"meta_data" = attributes(data)$horizon))
} else if (attributes(data)$horizon == 12) { # Model 2
model <- randomForest::randomForest(formula = model_formula,
data = data,
ntree = n_tree[2])
return(list("my_trained_model" = model, "n_tree" = n_tree[2],
"meta_data" = attributes(data)$horizon))
}
}
return()
values from the user-defined model_function()
. The returned values are stored in my_training_results$horizon_h$window_w$model
.## $my_trained_model
##
## Call:
## randomForest(formula = model_formula, data = data, ntree = n_tree[1])
## Type of random forest: regression
## Number of trees: 200
## No. of variables tried at each split: 13
##
## Mean of squared residuals: 247.162
## % Var explained: 59.72
##
## $n_tree
## [1] 200
##
## $meta_data
## [1] 3
## $my_trained_model
##
## Call:
## randomForest(formula = model_formula, data = data, ntree = n_tree[2])
## Type of random forest: regression
## Number of trees: 100
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 420.6497
## % Var explained: 31.45
##
## $n_tree
## [1] 100
##
## $meta_data
## [1] 12
create_lagged_df()
(type = "train"
or type = "forecast"
).prediction_function <- function(model, data_features) {
if (model$meta_data == 3) { # Perform a transformation specific to model 1.
data_pred <- data.frame("y_pred" = predict(model$my_trained_model, data_features))
}
if (model$meta_data == 12) { # Perform a transformation specific to model 2.
data_pred <- data.frame("y_pred" = predict(model$my_trained_model, data_features))
}
return(data_pred)
}
data_results <- predict(model_results,
prediction_function = list(prediction_function),
data = data_train)