The package FSinR contains functions to perform the feature selection process. More specifically, it contains a large number of filter and wrapper methods widely used in the literature that are combined with search algorithms in order to obtain an optimal subset of features. The FSinR package uses the functions for training classification and regression models available in the R caret package to generate wrapper measures. This gives the package a great background of methods and functionalities. In addition, the package has been implemented in such a way that its use is as easy and intuitive as possible. This is why the package contains a number of high-level functions from which any search algorithm or filter/wrapper method can be called.
The way to install the package from the CRAN repository is as follows:
As mentioned above, the feature selection process aims to obtain an optimal subset of features. To do this, a search algorithm is combined with a filter/wrapper method. The search algorithm guides the process in the features space according to the results returned by the filter/wrapper methods of the evaluated subsets. The optimal subset of features obtained in this process is very useful to generate simpler models, which require less computational resources and which present a better final performance.
The feature selection process is done with the featureSelection
function. This is the main function of the package and its parameters are:
The result of this function mainly returns the best subset of features found and the measure value of that subset. In addition, it returns another group of variables such as the execution time, etc.
The search functions present in the package are the following:
The package contains a function, searchAlgorithm
, which allows you to select the search algorithm to be used in the feature selection process. The function consists of the following parameters:
The result of the call to this function is another function to be used in the main function as a search algorithm.
The filter methods implemented in the package are:
The filterEvaluator
function allows you to select a filter method from the above. The function has the parameters:
The result of the function call is a function that is used as a filter evaluation method in the main function, featureSelection
.
The FSinR package allows the possibility of using the 238 models available in the caret package as wrapper methods. The complete list of caret models can be found here. In addition, the caret package offers the possibility of establishing a group of options to personalize the models (eg. resampling techniques, evaluation measurement, grid parameters, …) using the trainControl
and train
functions. In FSinR, the wrapperEvaluator
function is used to set all these parameters and use them to generate the wrapper model using as background the functions of caret. The wrapperEvaluator
function has as parameters:
trainControl
functiontrain
function (x, y, method and trainControl not neccesary)The result of this function is another function that is used as a wrapper evaluation measure in the main function.
To demonstrate in a simple manner how package works, the iris dataset will be used in this example.
It is important to note that the FSinR package does not divide data into training and test data. Instead, it applies the feature selection process to the entire dataset passed to it as a parameter. Therefore, in a modeling process the user should divide the dataset prior to performing the feature selection process. Missing data must also be processed prior to the use of the package. But since the main purpose of this vignette is to illustrate the use of the package and not the entire modeling process, in this case the entire unpartitioned dataset will be used.
The iris dataset consists in 150 instances of 4 variables (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) that determine the type of iris plant. The target variable, Species, has 3 possible classes (setosa, versicolor, virginica).
Next we will illustrate an example of feature selection composed of a search algorithm and a wrapper method.
First, the wrapper method must be generated. To do this, you have to call the wrapperEvaluator
function and determine which model you want to use as a wrapper method. Since the iris problem is a classification problem, in the example we use a knn model.
The next step is to generate the search algorithm. This is done by calling the searchAlgorithm
function and specifying the algorithm you want to use. In this case we are going to use as a search algorithm a sequential search, sequentialForwardSelection
.
Once we have generated the search algorithm and the wrapper method we call the main function that performs the feature selection process.
The results show the best subset of features found and its evaluation.
results$bestFeatures
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,] 0 0 0 1
results$bestValue
#> [1] 0.9575654
The above example illustrates a simple use of the package, but this can also be more complex if you intend to establish certain parameters. Regarding the wrapper method, it is possible to tune the model parameters. In the first instance, you can establish the resampling parameters, which are the same arguments that are passed to the trainControl
function of the caret package. Secondly, you can establish the fitting parameters, which are the same as for the caret train
function.
resamplingParams <- list(method = "cv", number = 10)
fittingParams <- list(preProc = c("center", "scale"), metric="Accuracy", tuneGrid = expand.grid(k = c(1:20)))
evaluator <- wrapperEvaluator("knn", resamplingParams, fittingParams)
For more details, the way in which caret train and tune a model can be seen here. Note that the FSinR package is able to detect automatically depending on the metric whether the objective of the problem is to maximize or minimize.
As for the search algorithm, it is also possible to establish a certain number of parameters. These parameters are specific to each search algorithm. If in this case we now use a tabu search, the number of algorithm iterations, the size of the taboo list, as well as an intensification phase and a diversification phase, among others can be established.
searcher <- searchAlgorithm('tabu', list(tamTabuList = 4, iter = 5, intensification=2, iterIntensification=5, diversification=1, iterDiversification=5, verbose=FALSE) )
Most FSinR package search algorithms include a parameter called verbose, which if set to TRUE shows the development and information of the iterations of the algorithms per console. It is important for tabu search to note that the number of neighbors that are considered and evaluated in each iteration of the tabu search algorithm, numNeigh
, is set by default to as many as there are. Then it should be noted that a high value of this parameter considerably increases the calculation time.
Finally, the feature selection process is run again.
This example is very similar to the previous one. The main difference is to generate a filter method as an evaluator instead of a wrapper method. This is done with the function filterEvaluator
.
The search algorithm is then generated:
And finally the feature selection process is performed, passing to the featureSelection
function the filter method and the search algorithm generated.
Again, the best subset obtained and the value obtained by the evaluation measure are shown as results.
results$bestFeatures
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,] 0 1 0 0
results$bestValue
#> [1] 98.96749
As in the previous example, in this case you can establish both the parameters of the search algorithms and the filter methods if the functions have them.
Advanced use of the package allows individual evaluation of a set of features using the filter/wrapper methods. That is, the filter/wrapper methods can be used directly without the need to include them in a search algorithm. This is not a defined high-level functionality within the package, since an individual evaluation is not a feature selection process itself. But it is a functionality that can be realized and has to be taken into consideration.
To do this, an evaluator must be generated using a filter method or a wrapper method.
And then obtain a measure from the evaluator by passing as a parameter the dataset, the name of the dependent variable and the set of features to be evaluated.
resultFilter <- filter_evaluator(iris, 'Species', c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"))
#> Warning in filter_evaluator(iris, "Species", c("Sepal.Length", "Sepal.Width", :
#> The data seems not to be discrete, as it should be
resultFilter
#> [1] 1
resultWrapper <- wrapper_evaluator(iris, 'Species', c("Petal.Length", "Petal.Width"))
resultWrapper
#> [1] 0.9548126
In this case, the feature selection process consists of a direct search algorithm (cut-off method) combined with an evaluator based on filter or wrapper methods.
The direct feature selection process is performed using the directFeatureSelection
function. This function contains the same parameters as the featureSelection
function, with the difference that the searcher parameter is replaced by the directSearcher parameter. This parameter contains a direct search algorithm.
The function mainly returns the selected subset of features, and the value of the individual evaluation of each of these features.
The direct search functions present in the package are the following:
The package contains the directSearchAlgorithm
function that allows you to select the direct search algorithm to be used. The parameters are the same as the searchAlgorithm
function, except that the searcher parameter is replaced by the directSearcher parameter. This parameter contains the name of the chosen cut-off method.
The result of the call to this function is another function to be used in the main function as a direct search algorithm.
In the previous example we showed the functionality of the package on a classification example. In this example we are going to show a regression example. For this we use the mtcars dataset, composed of 32 instances of 10 variables. The regression problem consists in forecasting the value of the mpg variable.
Compared to the process in the previous example, to perform a direct feature selection process the main change is that in this case a direct search algorithm has to be generated. In addition, the function for performing the feature selection process also changes to directFeatureSelection
. While the generation of the filter or wrapper methods remains the same.
library(caret)
library(FSinR)
data(mtcars)
evaluator <- filterEvaluator('determinationCoefficient')
directSearcher <- directSearchAlgorithm('selectKBest', list(k=3))
results <- directFeatureSelection(mtcars, 'mpg', directSearcher, evaluator)
results$bestFeatures
#> cyl disp hp drat wt qsec vs am gear carb
#> [1,] 1 1 0 0 1 0 0 0 0 0
results$featuresSelected
#> [1] "wt" "cyl" "disp"
results$valuePerFeature
#> [1] 0.7528328 0.7261800 0.7183433
The hybrid feature selection process is composed by a hybrid search algorithm that combines the measures obtained from 2 evaluators. This process is performed with the hybridFeatureSelection
function. The parameters of this function are as follows:
The function returns again the best subset of features found and their measure among other values.
The hybrid search function present in the package is the following:
The package contains the hybridSearchAlgorithm
function that allows you to select this hybrid search function. The parameters are the same as the searchAlgorithm
and directSearchAlgorithm
function, except that the searcher (directSearcher) parameter is replaced by the hybridSearcher parameter. This parameter contains the name of the chosen hybrid method.
The result of the call to this function is another function to be used in the main function as a hybrid search algorithm.
In this example we continue with the previous regression example.
The hybrid feature selection process is slightly different from the above. This is because in addition to generating a hybrid search algorithm, two evaluators must be generated. The package only implements a hybrid search algorithm, LCC. This algorithm requires a first evaluator to evaluate the features individually, and a second evaluator to evaluate the features conjunctively.
The package implements different filter/wrapper methods focused on individual feature evaluation or set evaluation. Set evaluation methods can also evaluate features individually. But individual evaluation methods (Cramer, Chi squared, F-Score and Relief) cannot evaluate sets of features. Therefore, to perform a hybrid feature selection process, a hybrid search algorithm has to be generated together with an individual feature evaluator (individual or set evaluation method) and a set evaluator (set evaluation method)
library(caret)
library(FSinR)
data(mtcars)
evaluator_1 <- filterEvaluator('determinationCoefficient')
evaluator_2 <- filterEvaluator('ReliefFeatureSetMeasure')
hybridSearcher <- hybridSearchAlgorithm('LCC')
results <- hybridFeatureSelection(mtcars, 'mpg', hybridSearcher, evaluator_1, evaluator_2)
results$bestFeatures
#> cyl disp hp drat wt qsec vs am gear carb
#> [1,] 1 1 1 1 1 1 1 1 1 1
results$bestValue
#> [1] 0.0171875