swag is a package that trains a meta-learning procedure that combines screening and wrapper methods to find a set of extremely low-dimensional attribute combinations. swag works on top of the caret package and proceeds in a forward-step manner. More specifically, it builds and tests learners starting from very few attributes until it includes a maximal number of attributes by increasing the number of attributes at each step. Hence, for each fixed number of attributes, the algorithm tests various (randomly selected) learners and picks those with the best performance in terms of training error. Throughout, the algorithm uses the information coming from the best learners at the previous step to build and test learners in the following step. In the end, it outputs a set of strong low-dimensional learners.
Given the above intuitive description, we now provide a more formal introduction and, for this reason, we define some basic notation. Let \(\mathbf{y} \in \mathbb{R}^n\) denote the response and \(\mathbf{X} \in \mathbb{R}^{n \times p}\) denote an attribute matrix with \(n\) instances and \(p\) attributes, the latter being indexed by a set \(\mathcal{S} := \{1, ... , p\}\). In addition, a generic learning mechanism is denoted as \(\mathcal{L}:= \mathcal{L}(\mathbf{y}, \mathbf{X})\) with \(l\) denoting a general learner which is built by using (i) the learning mechanism \(\mathcal{L}\) and (ii) a subset of attributes in \(\mathbf{X}\).
Before getting started, install the devtools package. Then, swag can be directly obtained from Github with the following code:
::install_github("SMAC-Group/SWAG-R-Package")
remotes
library(swag) #load the new package
The purpose of this section is to give a general sense of the package, including the components, what they do and some basic usage. We will briefly go over the main functions, see the basic operations and have a look at the outputs. You may have a better idea after this section regarding what functions are available, which one to choose, or at least where to seek for more detailed informations.
We propose to use a dataset readily available from the package mlbench. The dataset consists of a cohort of \(n = 699\) patients and the objective is to predict whether a new patient has a malignant tumour given a collection of \(p = 9\) attributes (tape ?mlbench::BreastCancer
for more details). We can start by splitting the data into training and test set. Alternatively you can either load directly your own data or use those saved in the workspace following exactly the same steps outlined in the next paragraphs.
# After having installed the mlbench package
data(BreastCancer, package = "mlbench")
# Pre-processing of the data
BreastCancer$Class # response variable
y <- as.matrix(BreastCancer[setdiff(names(BreastCancer),c("Id","Class"))]) # features
x <-
# remove missing values and change to 'numeric'
which(apply(x,1,function(x) sum(is.na(x)))>0)
id <- y[-id]
y <- x[-id,]
x <- apply(x,2,as.numeric)
x <-
# Training and test set
set.seed(180) # for replication
sample(1:dim(x)[1],dim(x)[1]*0.2)
ind <- y[ind]
y_test <- y[-ind]
y_train <- x[ind,]
x_test <--ind,] x_train <-x[
Now we are ready to train the swag on the breast cancer dataset. As previously mentionned, we build upon the framework of the package caret thus experimented users of this package will find the whole implementation easier. In any case, we will explain in detail all the important steps needed for swag and we suggest to the interested reader the following detailed e-book: caret.
Before getting started, we load the caret library
## if not installed
## install.packages("caret")
library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
The first step is to fix the meta-parameters of the swag procedures: \(p_{max}\), \(\alpha\) and \(m\). As the name suggests, \(p_{max}\) is the maximum dimension of attributes that the user wants to be input in a generic \(\mathcal{L}(\mathbf{y}, \mathbf{X})\). Based on this parameter, the swag aims at exploring the space of attributes in order to find sets of learners using \(\hat{p}\) attributes (\(1 \leq \hat{p} \leq p_{\text{max}}\)) with extremely high predictive power. To do so, the algorithm makes use of the step-wise screening procedure described briefly in the introduction. Another key element is \(\alpha\): a performance quantile which represents the percentage of learners which are selected at each dimension \(1 \leq \hat{p} \leq p_{max}\). Finally we need to choose \(m\) which represent the maximum numbers of learners which will be trained at each dimension \(\hat{p} > 1\) (i.e. we train all \(p\) learners of dimension \(1\)). We can fix all these meta-parameters, together with a seed for replicability purposes and verbose = TRUE
to get a message as each dimension is completed, thanks to the swagcontrol() function which behaves similarly to the trControl =
argument of caret.
# Meta-parameters chosen for the breast cancer dataset
swagControl(pmax = 4L,
swagcon <-alpha = 0.5,
m = 20L,
seed = 163L, #for replicability
verbose = T #keeps track of completed dimensions
)
# Given the low dimensional dataset, we can afford a wider search by fixing alpha = 0.5 as a smaller alpha may also stop the training procedure earlier than expected.
If you do not specify these values, you will get the default values: \(p_{max} = 3\), \(\alpha = 0.05\) and \(m = 100\). Ideally, with unlimited computing power, \(p_{max}\) and \(m\) should be as large as possible, i.e. \(p_{\text{max}} = p\) and \(m = \binom{p}{\lceil \frac{p}{2}\rceil}\). However, this is typically unrealistic and therefore the decision of these parameters must be based mainly on interpretability/replicability requirements as well as available computing power and time constraints. Below are some rules-of-thumb for the choice of these parameters:
\(\mathbf{p_{\text{max}}}\): Fixing the available computing power and the efficiency of the learning mechanism \(\mathcal{L}\), this parameter will depend on the total dimension of the problem \(p\). Indeed, the goal of swag is to find extremely small dimensional learners. Therefore, even with very large \(p\), one could always fix this parameter within a range of 5-20 (or smaller) for interpretability and/or replicability purposes. In addition, if an embedded method is computationally efficient to compute on the entire dataset, this parameter could be the number of selected attributes through this method (given computational constraints). Another criterion, when working with binary classification problems, is to use the Event Per Variable (EPV) rule. In future work, this parameter can be implicitly determined by the algorithm based on the training error quantile (or other metric) thereby defining \(p_{\text{max}}\) as the attribute dimension where the training error curve stops decreasing significantly similarly to the scree plot in factor or principal component analysis.
\(\boldsymbol{\alpha}\): this parameter is related to the maximum number of learners \(m\). The larger \(\alpha\), the more the attribute space is explored. Ideally, we want to choose a small \(\alpha\) since we would want to select strong learners (with extremely low training error) and this is possible if \(m\) is large enough. Generally good values for \(\alpha\) are \(0.01\) or \(0.05\), implying that (roughly) 1% or 5% of the \(m\) learners are selected at each step.
\(\mathbf{m}\): Fixing the available computing power and the efficiency of the learning mechanism \(\mathcal{L}\), this parameter will determine the proportion of attribute space that will be explored by the algorithm. We know that it depends on the size of the problem \(p\) since we necessarily have \(m \geq p\) for the screening step with models of unitary dimension. In addition, this parameter needs to be chosen considering the performance percentile \(\alpha\): if \(m\) is small and \(\alpha\) is small, then the number of strong learners being selected could be extremely low (possibly zero). In general, we would want a large \(m\) (so that \(\alpha\) can eventually be chosen to be very small) and, fixing \(p^\star\) as the number of attributes released from the first screening, a rule-of-thumb is to set \(m = \binom{p^\star}{2}\) (or close to it) in order to explore the entire (or most of the) subspace of two-dimensional learners generated by \(p^\star\).
Having set-up the meta-parameters as explained above, we are now ready to train the swag. We start with the Support Vector Machine learner, both linear and radial, as displayed by the chunk below:
## SVM Linear Learner
## `kernlab` is needed
## if not installed, install.packages("kernlab")
swag(
train_swag_svml <-# arguments for swag
x = x_train,
y = y_train,
control = swagcon,
auto_control = FALSE,
# arguments for caret
trControl = trainControl(method = "repeatedcv", number = 10, repeats = 1, allowParallel = F), # trainControl is from caret package
metric = "Accuracy",
method = "svmLinear", # Use method = "svmRadial" to train this specific learner
preProcess = c("center", "scale")
)#> [1] "Dimension explored: 1 - CV errors at alpha: 0.115"
#> [1] "Dimension explored: 2 - CV errors at alpha: 0.0549"
#> [1] "Dimension explored: 3 - CV errors at alpha: 0.0403"
#> [1] "Dimension explored: 4 - CV errors at alpha: 0.0394"
The only difference with respect to the classic caret train function, is the specification of the swag arguments which have been explained previously. To give an overview, in the above chunk for the svmLinear learner, we have chosen to fix 10-fold cross-validation repeated 1 time as our estimator of the out-of-sample accuracy that we selected as our metric to evaluate the classifier’s performance. For this specific case, we have chosen to center and rescale the data, as usually done for svms, and, the parameter that controls the margin in svms is automatically fixed at unitary value (i.e. \(c=1\)). For further explanations, we redirect the interested reader to the detailed e-book: caret.
Let’s have a look at the typical output of a swag training object for the svmLinear learner:
$CVs
train_swag_svml#> [[1]]
#> [1] 0.14094276 0.06959836 0.07499399 0.15157407 0.10811688 0.08592593 0.11502886
#> [8] 0.12070707 0.22122896
#>
#> [[2]]
#> [1] 0.05107744 0.06225950 0.03852213 0.05492304 0.06030544 0.04377104
#> [7] 0.05108225 0.06212121 0.07485570 0.05491582
#>
#> [[3]]
#> [1] 0.04010101 0.04761063 0.03848846 0.04030784 0.04575758 0.04016835 0.03841991
#> [8] 0.04387205 0.05105099
#>
#> [[4]]
#> [1] 0.03464646 0.04572751 0.04030664 0.03852213
# A list which contains the cv training errors of each learner explored in a given dimension
$VarMat
train_swag_svml#> [[1]]
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
#> [1,] 1 2 3 4 5 6 7 8 9
#>
#> [[2]]
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 2 2 2 2 3 3 3 5 5 6
#> [2,] 3 5 6 7 5 6 7 6 7 7
#>
#> [[3]]
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
#> [1,] 2 2 2 3 2 2 3 3 5
#> [2,] 3 3 6 6 3 5 5 5 6
#> [3,] 6 7 7 7 5 6 6 7 7
#>
#> [[4]]
#> [,1] [,2] [,3] [,4]
#> [1,] 2 2 2 3
#> [2,] 3 3 5 5
#> [3,] 6 5 6 6
#> [4,] 7 6 7 7
# A list which contrains a matrix, for each dimension, with the attributes tested at that step
$cv_alpha
train_swag_svml#> [1] 0.11502886 0.05491943 0.04030784 0.03941438
# The cut-off cv training error, at each dimension, determined by the choice of alpha
The other two learners that we have carefully implemented on swag are: lasso (glmnet package required) and random forest (party package required). The training phase for these learners, differs a little with respect to the svm one. We start looking at the lasso:
## Lasso Learner
## `glmnet` is needed
## if not installed, install.packages("glmnet")
swag(
train_swag_lasso <-# arguments for swag
x = x,
y = y,
control = swagcon,
auto_control = FALSE,
# arguments for caret
trControl = trainControl(method = "repeatedcv", number = 10, repeats = 1, allowParallel = F), # trainControl is from caret package
metric = "Accuracy",
method = "glmnet",
tuneGrid=expand.grid(alpha = 1, lambda = seq(0,.35,length.out=10)),
family="binomial",
# dynamically modify arguments for caret
caret_args_dyn = function(list_arg,iter){
if(iter==1){
$method = "glm"
list_arg$tuneGrid = NULL
list_arg
}
list_arg
}
)#> [1] "Dimension explored: 1 - CV errors at alpha: 0.1214"
#> [1] "Dimension explored: 2 - CV errors at alpha: 0.0616"
#> [1] "Dimension explored: 3 - CV errors at alpha: 0.0483"
#> [1] "Dimension explored: 4 - CV errors at alpha: 0.041"
The newly introduced argument caret_args_dyn
enables the user to modify the hyper-parameters related to a given learner in a dynamic way since they can change as the dimension \(\hat{p}\) grows up to the desired \(p_{max}\). In the case of lasso, caret_args_dyn =
clarifies that if we are training unitary \(\mathcal{L}(\mathbf{y}, \mathbf{x}_{i})\) (i.e. learners with a unique attribute \(\mathbf{x}_{i} \; \forall \;i \in \mathcal{S}\) ) then we will use a logistic regression (i.e. an un-penalized learner). This modification is in fact due to the implementation of the lasso in glmnet package as one attribute is not accepted (see this discussion). On the other hand, for the random forest case, we would ideally want to adapt the mtry hyper-parameter as the dimension grows. In the example below, we fix \(mtry = \sqrt{\hat{p}}\) as it is usually done in practice.
## Random Forest Learner
## `randomForest` is needed
## if not installed, install.packages("randomForest")
swag(
train_swag_rf <-# arguments for swag
x = x,
y = y,
control = swagcon,
auto_control = FALSE,
# arguments for caret
trControl = trainControl(method = "repeatedcv", number = 10, repeats = 1, allowParallel = F), # trainControl is from caret package
metric = "Accuracy",
method = "rf",
# dynamically modify arguments for caret
caret_args_dyn = function(list_arg,iter){
$tuneGrid = expand.grid(.mtry=sqrt(iter))
list_arg
list_arg
}
)#> [1] "Dimension explored: 1 - CV errors at alpha: 0.0996"
#> [1] "Dimension explored: 2 - CV errors at alpha: 0.0534"
#> [1] "Dimension explored: 3 - CV errors at alpha: 0.0461"
#> [1] "Dimension explored: 4 - CV errors at alpha: 0.0425"
Indeed a nice feature of swag, that derives from its building block caret, is that you can tailor the learning arguments of swag() as you like introducing for example grids for the hyper-parameters specific of a given learner or update these grids as the dimension increases. This gives to the user a wide range of possibilities and a lot of flexibility in the training phase.
To conclude this opening section, we present the usual predict() function which can be applied to a swag trained object similarly to many other packages in R.
# best learner predictions
# if `newdata` is not specified, then predict gives predictions based on the training sample
sapply(predict(object = train_swag_svml), function(x) head(x))
#> $predictions
#> [,1]
#> [1,] 2
#> [2,] 1
#> [3,] 2
#> [4,] 1
#> [5,] 2
#> [6,] 1
#>
#> $models
#> $models[[1]]
#> [1] 2 3 6 7
# best learner predictions
predict(object = train_swag_svml,
best_pred <-newdata = x_test)
sapply(best_pred, function(x) head(x))
#> $predictions
#> [,1]
#> [1,] 1
#> [2,] 1
#> [3,] 1
#> [4,] 2
#> [5,] 1
#> [6,] 1
#>
#> $models
#> $models[[1]]
#> [1] 2 3 6 7
# predictions for a given dimension
predict(
dim_pred <-object = train_swag_svml,
newdata = x_test,
type = "attribute",
attribute = 4L)
sapply(dim_pred,function(x) head(x))
#> $predictions
#> [,1] [,2] [,3] [,4]
#> [1,] 1 1 1 1
#> [2,] 1 1 1 1
#> [3,] 1 1 1 1
#> [4,] 2 2 2 2
#> [5,] 1 1 1 1
#> [6,] 1 1 1 1
#>
#> $models
#> $models[[1]]
#> [1] 2 3 6 7
#>
#> $models[[2]]
#> [1] 2 3 5 6
#>
#> $models[[3]]
#> [1] 2 5 6 7
#>
#> $models[[4]]
#> [1] 3 5 6 7
# predictions below a given CV error
predict(
cv_pred <-object = train_swag_svml,
newdata = x_test,
type = "cv_performance",
cv_performance = 0.04)
sapply(cv_pred,function(x) head(x))
#> $predictions
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 1 1 1 1 1
#> [2,] 1 1 1 1 1
#> [3,] 1 1 1 1 1
#> [4,] 2 2 2 2 2
#> [5,] 1 1 1 1 1
#> [6,] 1 1 1 1 1
#>
#> $models
#> $models[[1]]
#> [1] 2 6
#>
#> $models[[2]]
#> [1] 2 6 7
#>
#> $models[[3]]
#> [1] 3 5 6
#>
#> $models[[4]]
#> [1] 2 3 6 7
#>
#> $models[[5]]
#> [1] 3 5 6 7
Now we can for example evaluate the performance of the best learner selected by swag thanks to the confusionMatrix() function of caret.
# transform predictions into a data.frame of factors with levels of `y_test`
factor(levels(y_test)[best_pred$predictions])
best_learn <-confusionMatrix(best_learn,y_test) # from caret package
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction benign malignant
#> benign 87 4
#> malignant 3 42
#>
#> Accuracy : 0.9485
#> 95% CI : (0.8968, 0.9791)
#> No Information Rate : 0.6618
#> P-Value [Acc > NIR] : 6.123e-16
#>
#> Kappa : 0.8844
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 0.9667
#> Specificity : 0.9130
#> Pos Pred Value : 0.9560
#> Neg Pred Value : 0.9333
#> Prevalence : 0.6618
#> Detection Rate : 0.6397
#> Detection Prevalence : 0.6691
#> Balanced Accuracy : 0.9399
#>
#> 'Positive' Class : benign
#>