crtests
intends to be an application programming interface (API) for machine learning in R. Its goal is to be a flexible and fast framework for running a regression or classification test on a data set. To achieve this, the package acts as a wrapper to algorithms like Random Forests, decision trees or support vector machines, making sure that data is in the right format for these algorithms.
Some algorithms have special requirements, such as a maximum number of predictors, a maximum number of categories for categorical predictors, or an inability to deal with certain column types. crtests
standard implementation deals with some of these requirements, but gives room to deal with algorithm-specific requirements. This vignette describes how these can be implemented, by giving an example implementation for both Breiman’s Random Forests and for rpart
decision trees.
To know what to implement for a specific algorithm, it is necessary to know what is already implemented. The standard data preparation through the prepare
function does the following:
na.omit
R’s implementation of Random Forests through the randomForest
packages has a few specific requirements. It can for example not deal with levels in the dependent variable of the holdout set that were not in the training set, nor can it deal with NAs. These have been handled by the default prepare
and its helper method prepare_data
. randomForest
can also not deal with categorical predictors with more than 32 categories. This needs to be addressedthrough an implementation of method_prepare
.
method_prepare
is an S3-generic that gets called with a method: a character of class method, which would be ‘randomForest’ in this case. The function executed depends on the method, so a function method_prepare.randomForest
needs to be written.
This function should reduce the number of levels of each factor to at most 32. One way is to drop all levels that are not among the 32 most frequent. This could lead to significant loss of data (through the introduction of NAs). The other option is to group the infrequent levels (those not in the top 31) into one level called ‘other’. This would not lead to data loss through NAs, so it is perhaps preferable.
crtests
provides a function that handles this grouping: group_levels
. This is also a generic, and can be called with a data.frame or a factors. This function does just what was stated before: any levels outside the maximum-1
most frequent are grouped into one level called other
.
So, the implementation of method_prepare.randomForest
could look something like this:
method_prepare.randomForest <- function(method, test, data, ...){
# Use group_levels to reduce number of factor levels to at most 32
group_levels(data = data, maximum_levels = 32)
Now, this satisfies the core requirement for an implementation of method_prepare
: it returns a list of data containing a training and holdout set. But now, there might be levels in the holdout set that are not in the training set. So, prepare_data
needs to be called again to relevel the holdout factors and remove NAs. So, the complete implementation looks like this:
method_prepare.randomForest <- function(method, test, data, ...){
# Use group_levels to reduce number of factor levels to at most 32
data <- group_levels(data = data, maximum_levels = 32)
# Make sure the factor levels in holdout are also in the training set, and remove NAs this introduces.
data$holdout <- prepare_data(data$holdout, data$train)
data
After data preparation, crtests
also provides a ‘hook’ to deal with method specific requirements for model training. The classification_model.default
function calls the machine learning method with an x
, y
and the training data. This is not always sufficient, like for the decision tree package rpart
: this also requires a formula argument.
Implementing this is simple by creating a classification_model.rpart
function and accessing the test
parameter that is passed to that function:
classification_model.rpart <- function(method, test, x, y, training_data, ...){
# extract_formula creates a formula of the form 'dependent ~ .', which is an attribute of the test object
f <- extract_formula(test)
rpart(formula=f, data=training_data, method="class")
}
The make_predictions
functions is also generic, and get called with a model produced by a machine learning algorithm, providing another hook for method specific implementations. make_predictions
is a wrapper to predict
, so when predict.'model'
needs specific arguments other than model
and newdata
, a new make_predictions
can be implemented.
Again, rpart
has specific requirements: it needs a type
parameter, either class
for classification or vector
for regression problems. The problem can be extracted from the test
parameter passed to make_predictions
, so the implementation of make_predictions.rpart
could look like this:
make_predictions.rpart <- function(model, data, test, ...){
type <- ""
if(class(test)=="classification"){
type <- "class"
}
else if (class(test)=="regression"){
type <- "vector"
}
else {
stop(paste("Tests of type", class(test), "are not supported by make_predictions.rpart"))
}
predict(model, newdata=data$holdout, type=type,...)
}