library(hardhat)
library(tibble)
library(modeldata)
data(penguins)
<- na.omit(penguins) penguins
The goal of this vignette is to teach you how to use
mold()
and forge()
in a modeling package. This
is the intended use of these functions, even though they can also be
called interactively. Creating a new modeling package has two main
stages: creating the model fitting function, and implementing a predict
method. The stages break down like this:
Stage 1 - Model Fitting
Create a model constructor.
Create a fitting implementation function.
Create a common bridge to go between high level user facing methods and the lower level constructor and implementation function.
Create a user facing function with methods for data frame, matrix, formula, and recipe inputs.
I imagine that this comes together as a few internal pieces that power the user facing methods for your model.
Stage 2 - Model Prediction
Create one or more prediction implementation functions, varying
by the "type"
of prediction to make.
Create a common bridge between a high level predict method and the lower level prediction implementation functions.
Create a user facing predict method.
In this case, there are 2 user facing methods. Many models have
multiple internal implementation functions that you’ll switch between,
depending on the "type"
.
The end result is a single high level modeling function that has methods for multiple different “interfaces”, and a corresponding predict method to make predictions using one of these models along with new data (by “interfaces”, I just mean the different types of inputs, so: data frame, matrix, formula and recipe).
There are obviously other things that you might want your modeling
package to do. For instance, you might implement a plot()
or summary()
method. But the two stages described here are
necessary for almost every model, and they involve the inputs and
outputs that hardhat helps the most with.
We will use the underlying lm()
infrastructure,
lm.fit()
, to create our model. Linear regression should be
recognizable to many, so we can focus on understanding how
mold()
and forge()
fit in the bigger picture,
rather than trying to understand how the model works.
lm.fit()
takes x
and y
directly, rather than using the formula method. It will serve as the
core part of our modeling implementation function. More generally, it is
easiest if the core implementation function of your algorithm takes
x
and y
in this manner, since that is how
mold()
will standardize the inputs.
We will call the model simple_lm()
. It won’t have all
the features of normal linear regression (weights, offsets, etc), but
will serve as a nice dummy model to show off features you get with
hardhat.
The first thing we need is a modeling constructor.
Constructors are very simple functions that creates new objects of our
model class. In the arguments to the constructor you supply all of the
individual pieces, and it wraps them up into a model object. The hardhat
function new_model()
can help with creating that.
A model constructor should:
Have the name new_<model_class>()
.
Take the required model elements as named arguments, including a
required blueprint
.
Validate the types of the new elements.
Pass the named elements on to new_model()
along with
setting the class to "<model_class>"
.
If you want to learn more about the details of constructors and creating S3 classes, take a look at the S3 section in Advanced R.
<- function(coefs, coef_names, blueprint) {
new_simple_lm
if (!is.numeric(coefs)) {
stop("`coefs` should be a numeric vector.", call. = FALSE)
}
if (!is.character(coef_names)) {
stop("`coef_names` should be a character vector.", call. = FALSE)
}
if (length(coefs) != length(coef_names)) {
stop("`coefs` and `coef_names` must have the same length.")
}
new_model(
coefs = coefs,
coef_names = coef_names,
blueprint = blueprint,
class = "simple_lm"
) }
A "simple_lm"
object has just enough information to make
numeric predictions on new data, but you can store other things here as
well to enable your model object to work with extra post-fitting
functionality.
We can test this by manually generating a model object. Along with
the custom class we provided, this object also has a
"hardhat_model"
class. There is a very simple print method
for objects of this type. Specifically, it prints the name of the class
at the top, and only prints out the custom elements (i.e. not the
blueprint
).
<- new_simple_lm(1, "my_coef", default_xy_blueprint())
manual_model
manual_model#> <simple_lm>
#> $coefs
#> [1] 1
#>
#> $coef_names
#> [1] "my_coef"
names(manual_model)
#> [1] "coefs" "coef_names" "blueprint"
$blueprint
manual_model#> XY blueprint:
#>
#> # Predictors: 0
#> # Outcomes: 0
#> Intercept: FALSE
#> Novel Levels: FALSE
#> Composition: tibble
The implementation function is where the hard work is done. I
generally recommend naming it <model_class>_impl()
.
It should accept predictors and outcomes in whatever form is required
for the algorithm, run the algorithm, and return a named list of the new
elements you added to the model constructor. You might also have
arguments for extra options that can be used to tweak the internal
algorithm.
<- function(predictors, outcomes) {
simple_lm_impl <- lm.fit(predictors, outcomes)
lm_fit
<- lm_fit$coefficients
coefs
<- names(coefs)
coef_names <- unname(coefs)
coefs
list(
coefs = coefs,
coef_names = coef_names
) }
This simple linear regression implementation just calls
lm.fit()
with x = predictors
and
y = outcomes
. lm.fit()
expects a matrix of
predictors and a vector of outcomes (at least for univariate
regression). In a moment we will discuss how to create those.
<- as.matrix(subset(penguins, select = bill_length_mm))
predictors <- penguins$body_mass_g
outcomes
simple_lm_impl(predictors, outcomes)
#> $coefs
#> [1] 95.49649
#>
#> $coef_names
#> [1] "bill_length_mm"
Now that we have our constructor and our implementation function, we
can create a common function that will be used in all of our top level
methods (for data frames, matrices, formulas, and recipes). It will call
the implementation function, and then use that information along with
the blueprint to create a new instance of our model. It should have an
argument for the output from a call to mold()
, here I’ve
called that processed
. In that object will be (at a
minimum) the predictors, outcomes, and blueprint. You might also have
arguments for additional options to pass on to the implementation
function.
The bridge function should take the standardized predictors and
outcomes and convert them to the lower level types that the
implementation function requires. The predictors
and
outcomes
that are returned from mold()
will
always be data frames, so in this case we can convert them to
matrices and vectors directly for use in the lower level function.
This is also a good place to use some of hardhat’s validation
functions. In this case, we always expect the outcome to have a single
column since this is a univariate model, so we can use
validate_outcomes_is_univariate()
to enforce that.
<- function(processed) {
simple_lm_bridge
validate_outcomes_are_univariate(processed$outcomes)
<- as.matrix(processed$predictors)
predictors <- processed$outcomes[[1]]
outcomes
<- simple_lm_impl(predictors, outcomes)
fit
new_simple_lm(
coefs = fit$coefs,
coef_names = fit$coef_names,
blueprint = processed$blueprint
) }
At this point, we can simulate user input and pass it on to our bridge to run a model.
# Simulate formula interface
<- mold(bill_length_mm ~ body_mass_g + species, penguins)
processed_1
# Simulate xy interface
<- mold(x = penguins["body_mass_g"], y = penguins$bill_length_mm)
processed_2
simple_lm_bridge(processed_1)
#> <simple_lm>
#> $coefs
#> [1] 0.003754612 24.908763524 34.817525835 28.447942512
#>
#> $coef_names
#> [1] "body_mass_g" "speciesAdelie" "speciesChinstrap" "speciesGentoo"
simple_lm_bridge(processed_2)
#> <simple_lm>
#> $coefs
#> [1] 0.01022951
#>
#> $coef_names
#> [1] "body_mass_g"
Multiple outcomes are an error:
<- mold(bill_length_mm + bill_depth_mm ~ body_mass_g + species, penguins)
multi_outcome
simple_lm_bridge(multi_outcome)
#> Error in `validate_outcomes_are_univariate()`:
#> ! The outcome must be univariate, but 2 columns were found.
With all of the pieces in place, we have everything we need to create
our high level modeling interface. This should be a generic function,
generally with methods for data frames, matrices, formulas, and recipes.
Each method should call mold()
with the method specific
inputs to run the preprocessing, and then pass off to the bridge
function to run the actual model. It is also good practice to provide a
default method with a nice error message for unknown types.
# Generic
<- function(x, ...) {
simple_lm UseMethod("simple_lm")
}
# Default
<- function(x, ...) {
simple_lm.default stop(
"`simple_lm()` is not defined for a '", class(x)[1], "'.",
call. = FALSE
)
}
# XY method - data frame
<- function(x, y, ...) {
simple_lm.data.frame <- mold(x, y)
processed simple_lm_bridge(processed)
}
# XY method - matrix
<- function(x, y, ...) {
simple_lm.matrix <- mold(x, y)
processed simple_lm_bridge(processed)
}
# Formula method
<- function(formula, data, ...) {
simple_lm.formula <- mold(formula, data)
processed simple_lm_bridge(processed)
}
# Recipe method
<- function(x, data, ...) {
simple_lm.recipe <- mold(x, data)
processed simple_lm_bridge(processed)
}
Let’s give it a try:
<- penguins[c("bill_length_mm", "bill_depth_mm")]
predictors <- penguins$body_mass_g
outcomes_vec <- penguins["body_mass_g"]
outcomes_df
# Vector outcome
simple_lm(predictors, outcomes_vec)
#> <simple_lm>
#> $coefs
#> [1] 110.88151 -40.16918
#>
#> $coef_names
#> [1] "bill_length_mm" "bill_depth_mm"
# 1 column data frame outcome
simple_lm(predictors, outcomes_df)
#> <simple_lm>
#> $coefs
#> [1] 110.88151 -40.16918
#>
#> $coef_names
#> [1] "bill_length_mm" "bill_depth_mm"
# Formula interface
simple_lm(body_mass_g ~ bill_length_mm + bill_depth_mm, penguins)
#> <simple_lm>
#> $coefs
#> [1] 110.88151 -40.16918
#>
#> $coef_names
#> [1] "bill_length_mm" "bill_depth_mm"
We can use preprocessing as well, and it is handled by
mold()
.
library(recipes)
# - Log a predictor
# - Generate dummy variables for factors
simple_lm(body_mass_g ~ log(bill_length_mm) + species, penguins)
#> <simple_lm>
#> $coefs
#> [1] 3985.047 -10865.973 -11753.182 -10290.188
#>
#> $coef_names
#> [1] "log(bill_length_mm)" "speciesAdelie" "speciesChinstrap"
#> [4] "speciesGentoo"
# Same, but with a recipe
<- recipe(body_mass_g ~ bill_length_mm + species, penguins) %>%
rec step_log(bill_length_mm) %>%
step_dummy(species, one_hot = TRUE)
simple_lm(rec, penguins)
#> <simple_lm>
#> $coefs
#> [1] 3985.047 -10865.973 -11753.182 -10290.188
#>
#> $coef_names
#> [1] "bill_length_mm" "species_Adelie" "species_Chinstrap"
#> [4] "species_Gentoo"
You might have noticed that our linear regression isn’t adding an
intercept. Generally, with linear regression models we will want a
default intercept added on. To accomplish this, we can add an
intercept
argument to our user facing function, and then
use that to tweak the blueprint that would otherwise be created for you
automatically.
<- function(x, ...) {
simple_lm UseMethod("simple_lm")
}
<- function(x, y, intercept = TRUE, ...) {
simple_lm.data.frame <- default_xy_blueprint(intercept = intercept)
blueprint <- mold(x, y, blueprint = blueprint)
processed simple_lm_bridge(processed)
}
<- function(x, y, intercept = TRUE,...) {
simple_lm.matrix <- default_xy_blueprint(intercept = intercept)
blueprint <- mold(x, y, blueprint = blueprint)
processed simple_lm_bridge(processed)
}
<- function(formula, data, intercept = TRUE, ...) {
simple_lm.formula <- default_formula_blueprint(intercept = intercept)
blueprint <- mold(formula, data, blueprint = blueprint)
processed simple_lm_bridge(processed)
}
<- function(x, data, intercept = TRUE, ...) {
simple_lm.recipe <- default_recipe_blueprint(intercept = intercept)
blueprint <- mold(x, data, blueprint = blueprint)
processed simple_lm_bridge(processed)
}
# By default an intercept is included
simple_lm(predictors, outcomes_df)
#> <simple_lm>
#> $coefs
#> [1] 3413.45185 74.81263 -145.50718
#>
#> $coef_names
#> [1] "(Intercept)" "bill_length_mm" "bill_depth_mm"
# But the user can turn this off
simple_lm(body_mass_g ~ log(bill_length_mm) + species, penguins, intercept = FALSE)
#> <simple_lm>
#> $coefs
#> [1] 3985.047 -10865.973 -11753.182 -10290.188
#>
#> $coef_names
#> [1] "log(bill_length_mm)" "speciesAdelie" "speciesChinstrap"
#> [4] "speciesGentoo"
Note that even the formula method respects this
intercept
argument. To recap, by default
mold()
will not automatically add an intercept for
any method, including the formula method.
On the prediction side, we need implementation functions like for
fitting our model. These vary based on the "type"
argument
to predict()
. The "type"
might be
"numeric"
for numeric predictions as we will use here, or
it could be "class"
for hard class predictions,
"prob"
for class probabilities, and more. A set of
recommended names for "type"
can be found in the Model
Predictions section of the implementation principles.
For our model, we will focus on only returning numeric predictions. I
generally like to name these prediction implementation functions
predict_<model_class>_<type>()
. The arguments
for the implementation functions should include the model object and the
predictors in the form that your prediction algorithm expects (here, a
matrix).
Also used here is another hardhat function for standardizing
prediction output, spruce_numeric()
. This function tidies
up the numeric response output, and automatically standardizes it to
match the recommendations in the principles guide. The output is always
a tibble, and for the "numeric"
type it has 1 column,
.pred
.
<- function(object, predictors) {
predict_simple_lm_numeric
<- object$coefs
coefs
<- as.vector(predictors %*% coefs)
pred
<- spruce_numeric(pred)
out
out }
To test it, we will have to run a model and call forge()
on the output manually. The higher level user facing function will do
this automatically.
<- simple_lm(bill_length_mm ~ body_mass_g + species, penguins)
model
<- forge(penguins, model$blueprint)$predictors
predictors <- as.matrix(predictors)
predictors
predict_simple_lm_numeric(model, predictors)
#> # A tibble: 333 × 1
#> .pred
#> <dbl>
#> 1 39.0
#> 2 39.2
#> 3 37.1
#> 4 37.9
#> 5 38.6
#> 6 38.5
#> 7 42.5
#> 8 36.9
#> 9 39.2
#> 10 41.4
#> # … with 323 more rows
A prediction bridge converts the standardized predictors
into the lower level type that the prediction implementation functions
expect. predictors
here is always a data frame, and is part
of the return value of a call to forge()
. Since the
prediction implementation function takes a matrix, we convert it to that
here.
Additionally, it should switch()
on the
type
argument to decide which of the prediction
implementation functions to call. Here, when
type == "numeric"
, predict_simple_lm_numeric()
is called.
I also like using rlang::arg_match()
to validate that
type
is one of the accepted prediction types. This has an
advantage over match.arg()
in that partial matches are not
allowed, and the error messages are a bit nicer.
<- function(type, object, predictors) {
predict_simple_lm_bridge
<- rlang::arg_match(type, "numeric")
type
<- as.matrix(predictors)
predictors
switch(
type,numeric = predict_simple_lm_numeric(object, predictors)
) }
Let’s test:
<- simple_lm(bill_length_mm ~ body_mass_g + species, penguins)
model
# Pass in the data frame
<- forge(penguins, model$blueprint)$predictors
predictors
predict_simple_lm_bridge("numeric", model, predictors)
#> # A tibble: 333 × 1
#> .pred
#> <dbl>
#> 1 39.0
#> 2 39.2
#> 3 37.1
#> 4 37.9
#> 5 38.6
#> 6 38.5
#> 7 42.5
#> 8 36.9
#> 9 39.2
#> 10 41.4
#> # … with 323 more rows
# Partial matches are an error
predict_simple_lm_bridge("numer", model, predictors)
#> Error in `predict_simple_lm_bridge()`:
#> ! `type` must be one of "numeric", not "numer".
#> ℹ Did you mean "numeric"?
Finally, we can create an S3 method for the generic
predict()
function. To match the modeling principles, it
should use new_data
to accept a matrix or data frame of new
predictors.
The first thing that the predict()
method should do is
call forge()
with the new_data
and the
blueprint
that we attached to our simple_lm
model at fit time. This performs the required preprocessing on the new
data, and checks that the type of the data frame supplied to
new_data
matches the type of data frame supplied at fit
time. This is one of the most valuable features of forge()
,
as it adds a large amount of robustness to your
predict()
ion function. You will see some examples of this
at the end of the vignette.
After calling forge()
, it should pass off to the bridge
function to call the correct prediction function based on the type.
Finally, it is good practice to call the hardhat function,
validate_prediction_size()
, with the return value and the
original new_data
to ensure that the number of rows in the
output are the same as the number of rows of the input. If a prediction
cannot be made for a row of new_data
, an NA
value should be placed there instead. Mainly, this validation function
is a check on the model developer to ensure that you always return
output with a sane length.
<- function(object, new_data, type = "numeric", ...) {
predict.simple_lm
# Enforces column order, type, column names, etc
<- forge(new_data, object$blueprint)
processed
<- predict_simple_lm_bridge(type, object, processed$predictors)
out
validate_prediction_size(out, new_data)
out }
Finally, we can test out our top level modeling function along with
its corresponding predict()
method.
<- simple_lm(bill_length_mm ~ log(body_mass_g) + species, penguins)
model
predict(model, penguins)
#> # A tibble: 333 × 1
#> .pred
#> <dbl>
#> 1 39.1
#> 2 39.3
#> 3 36.9
#> 4 37.9
#> 5 38.7
#> 6 38.6
#> 7 42.5
#> 8 36.7
#> 9 39.3
#> 10 41.5
#> # … with 323 more rows
By using forge()
, you automatically get powerful type
checking to ensure that the new_data
is in a form that you
expect.
# `new_data` isn't a data frame
predict(model, penguins$species)
#> Error in `forge()`:
#> ! The class of `new_data`, 'factor', is not recognized.
# Missing a required column
predict(model, subset(penguins, select = -body_mass_g))
#> Error in `validate_column_names()`:
#> ! The following required columns are missing: 'body_mass_g'.
# In this case, 'species' is a character,
# but can be losslessy converted to a factor.
# That happens for you automatically and silently.
<- transform(penguins, species = as.character(species))
penguins_chr_species
predict(model, penguins_chr_species)
#> # A tibble: 333 × 1
#> .pred
#> <dbl>
#> 1 39.1
#> 2 39.3
#> 3 36.9
#> 4 37.9
#> 5 38.7
#> 6 38.6
#> 7 42.5
#> 8 36.7
#> 9 39.3
#> 10 41.5
#> # … with 323 more rows
# Slightly different from above. Here, 'species' is a character,
# AND has an extra unexpected factor level. It is
# removed with a warning, but you still get a factor
# with the correct levels
<- penguins_chr_species
penguins_chr_bad_species $species[1] <- "new_level"
penguins_chr_bad_species
predict(model, penguins_chr_bad_species)
#> Warning: Novel levels found in column 'species': 'new_level'. The levels have
#> been removed, and values have been coerced to 'NA'.
#> # A tibble: 333 × 1
#> .pred
#> <dbl>
#> 1 NA
#> 2 39.3
#> 3 36.9
#> 4 37.9
#> 5 38.7
#> 6 38.6
#> 7 42.5
#> 8 36.7
#> 9 39.3
#> 10 41.5
#> # … with 323 more rows
# This case throws an error.
# Here, 'species' is a double and
# when it should have been a factor.
# You can't cast a double to a factor!
<- transform(penguins, species = 1)
penguins_dbl_species
predict(model, penguins_dbl_species)
#> Error in `scream()`:
#> ! Can't convert `data$species` <double> to match type of `species` <factor<b22a0>>.