Extracting meta-features by group

To customize the measure extraction, is necessary to use specific methods for each group of measures. For instance, infotheo and statistical compute the information theoretical and the statistical measures, respectively. The following examples illustrate these cases:

## Extract two information theoretical measures
stat.iris <- infotheo(Species ~ ., iris, 
                         features=c("attrEnt", "jointEnt"))

## Extract three statistical measures
disc.iris <- statistical(Species ~ ., iris, 
                            features=c("cancor", "cor", "iqr"))

## Extract the histogram for the correlation measure
hist.iris <- statistical(Species ~ ., iris, 
                            features="cor", summary="hist")

Different from the metafeatures method, these methods receive a parameter called features, to define which features are required, and return a list instead of a numeric vector. In additional, some groups can be customized using additional arguments.

There are five measure groups which can be either general information about the dataset, statistical information, descriptors about information theoretical, measures designed to extract characteristics about the DT model or landmarks which represent the performance of simple algorithms applied to the dataset. The following example show the available groups:

## Show the the available groups
ls.metafeatures()

##  [1] "general"     "statistical" "infotheo"    "model.based" "landmarking"
##  [6] "relative"    "clustering"  "complexity"  "concept"     "itemset"

General

These are the most simple measures for extracting general properties of the datasets. For instance, nrAttr and nrClass are the total number of attributes in the dataset and the number of output values (classes) in the dataset, respectively. To list the measures of this group use ls.general(). The following examples illustrate these measures:

## Show the the available general measures
ls.general()

##  [1] "attrToInst" "catToNum"   "freqClass"  "instToAttr" "nrAttr"    
##  [6] "nrBin"      "nrCat"      "nrClass"    "nrInst"     "nrNum"     
## [11] "numToCat"

## Extract all general measures
general.iris <- general(Species ~ ., iris)

## Extract two general measures
general(Species ~ ., iris, features=c("nrAttr", "nrClass"))

## $nrAttr
## [1] 4
## 
## $nrClass
## [1] 3

The general measures return a list named by the requested measures. The post.processing methods are applied only for the freqClass meta-feature. For instance, to extract the minimum, maximum and the standard deviation of the classes proportion use:

## Extract two general measures
general(Species ~ ., iris, features="freqClass", summary=c("min", "max", "sd"))

## $freqClass
##       min       max        sd 
## 0.3333333 0.3333333 0.0000000

Statistical

Statistical meta-features are the standard statistical measures to describe the numerical properties of a distribution of data. As it requires only numerical attributes, the categorical data are transformed to numerical. For instance, cor and skewness are the absolute correlation between of each pair of attributes and the skewness of the numeric attributes in the dataset, respectively. To list the measures of this group use ls.statistical(). The following examples illustrate these measures:

## Show the the available statistical measures
ls.statistical()

##  [1] "canCor"      "gravity"     "cor"         "cov"         "nrDisc"     
##  [6] "eigenvalues" "gMean"       "hMean"       "iqRange"     "kurtosis"   
## [11] "mad"         "max"         "mean"        "median"      "min"        
## [16] "nrCorAttr"   "nrNorm"      "nrOutliers"  "range"       "sd"         
## [21] "sdRatio"     "skewness"    "sparsity"    "tMean"       "var"        
## [26] "wLambda"

## Extract all statistical measures
stat.iris <- statistical(Species ~ ., iris)

## Extract two statistical measures
statistical(Species ~ ., iris, features=c("cor", "skewness"))

## $cor
##      mean        sd 
## 0.5941160 0.3375443 
## 
## $skewness
##       mean         sd 
## 0.06273198 0.29439896

The statistical group can use two additional parameter called by.class and transform. To the former the default is by.class=FALSE which means that the meta-features are computed without consider the classes values. Otherwise, the measure is extracted using the instances separated by class. In the latter, the default value is transform=TRUE which means that categorical attributes will be transformed to numeric. The following example shows the use of these two definitions:

## Extract correlation using instances by classes
statistical(Species ~ ., iris, features="cor", by.class=TRUE)

## $cor
##      mean        sd 
## 0.4850530 0.2124471

## Ignore the class attributes
aux <- cbind(class=iris$Species, iris)
statistical(Species ~ ., aux, transform=FALSE)

## $canCor
##      mean        sd 
## 0.7280090 0.3631869 
## 
## $gravity
## [1] 3.208281
## 
## $cor
##      mean        sd 
## 0.5941160 0.3375443 
## 
## $cov
##      mean        sd 
## 0.5966542 0.5582672 
## 
## $nrDisc
## [1] 2
## 
## $eigenvalues
##     mean       sd 
## 1.143239 2.058771 
## 
## $gMean
##     mean       sd 
## 3.223073 2.022943 
## 
## $hMean
##     mean       sd 
## 2.978389 2.145948 
## 
## $iqRange
##     mean       sd 
## 1.700000 1.275408 
## 
## $kurtosis
##       mean         sd 
## -0.8105361  0.7326910 
## 
## $mad
##      mean        sd 
## 1.0934175 0.5785782 
## 
## $max
##     mean       sd 
## 5.425000 2.443188 
## 
## $mean
##     mean       sd 
## 3.464500 1.918485 
## 
## $median
##     mean       sd 
## 3.612500 1.919364 
## 
## $min
##     mean       sd 
## 1.850000 1.808314 
## 
## $nrCorAttr
## [1] 0.5
## 
## $nrNorm
## [1] 1
## 
## $nrOutliers
## [1] 1
## 
## $range
##  mean    sd 
## 3.575 1.650 
## 
## $sd
##      mean        sd 
## 0.9478671 0.5712994 
## 
## $sdRatio
## [1] 1.277229
## 
## $skewness
##       mean         sd 
## 0.06273198 0.29439896 
## 
## $sparsity
##       mean         sd 
## 0.08874363 0.13456821 
## 
## $tMean
##     mean       sd 
## 3.470556 1.904802 
## 
## $var
##     mean       sd 
## 1.143239 1.332546 
## 
## $wLambda
## [1] 0.02343863

Note that, in the first example the values and the cardinality of the measure are different since the correlation between the attributes were computed using the instances for each class separately. The post.processing methods are applied in these measures since they return multiple values. To define which them should be applied use the summary parameter, as detailed in the post.processing method.

Information Theoretical

Information theoretical meta-features are particularly appropriate to describe discrete (categorical) attributes, but they also fit continuous ones using a discretization process. These measures are based on information theory. For instance, normClassEnt and mutInf are the normalized entropy of the class and the common information shared between each attribute and the class in the dataset, respectively. To list the measures of this group use ls.infotheo(). The following examples illustrate these measures:

## Show the the available information theoretical measures
ls.infotheo()

## [1] "attrConc"  "attrEnt"   "classConc" "classEnt"  "eqNumAttr" "jointEnt" 
## [7] "mutInf"    "nsRatio"

## Extract all information theoretical measures
inf.iris <- infotheo(Species ~ ., iris)

## Extract two information theoretical measures
infotheo(Species ~ ., iris, features=c("normClassEnt", "mutInf"))

## $mutInf
##      mean        sd 
## 0.8439342 0.4222026

The Information theoretical group can use one additional parameter called transform. Using the default value transform=TRUE the continuous attributes will be discretized. The following example shows the use of this definition:

## Ignore the discretization process
aux <- cbind(class=iris$Species, iris)
infotheo(Species ~ ., aux, transform=FALSE)

## $attrConc
## mean   sd 
##   NA   NA 
## 
## $attrEnt
##     mean       sd 
## 1.584963       NA 
## 
## $classConc
## mean   sd 
##    1   NA 
## 
## $classEnt
## [1] 1.584963
## 
## $eqNumAttr
## [1] 1
## 
## $jointEnt
##     mean       sd 
## 1.584963       NA 
## 
## $mutInf
##     mean       sd 
## 1.584963       NA 
## 
## $nsRatio
## [1] 0

The information theoretical measures return a list named by the requested measures. The post.processing methods are applied in some measures since they return multiple values. To define which them should be applied use the summary parameter, as detailed in the section Post Processing Methods.

Model based

These measures describe characteristics of the investigated models. These meta-features can include, for example, the description of the DT induced for a dataset, like its number of leaves (leaves) and the number of nodes (nodes) of the tree. The following examples illustrate these measures:

## Show the the available model.based measures
ls.model.based()

##  [1] "leaves"         "leavesBranch"   "leavesCorrob"   "leavesHomo"    
##  [5] "leavesPerClass" "nodes"          "nodesPerAttr"   "nodesPerInst"  
##  [9] "nodesPerLevel"  "nodesRepeated"  "treeDepth"      "treeImbalance" 
## [13] "treeShape"      "varImportance"

## Extract all model.based measures
land.iris <- model.based(Species ~ ., iris)

## Extract three model.based measures
model.based(Species ~ ., iris, features=c("leaves", "nodes"))

## $leaves
## [1] 9
## 
## $nodes
## [1] 8

The DT model based measures return a list named by the requested measures. The post.processing methods are applied in these measures since they return multiple values. To define which them should be applied use the summary parameter, as detailed in the post.processing method.

Landmarking

Landmarking measures are simple and fast algorithms, from which performance characteristics can be extracted. These measures include the performance of simple and efficient learning algorithms like Naive Bayes (naiveBayes) and 1-Nearest Neighbor (oneNN). The following examples illustrate these measures:

## Show the the available landmarking measures
ls.landmarking()

## [1] "bestNode"    "eliteNN"     "linearDiscr" "naiveBayes"  "oneNN"      
## [6] "randomNode"  "worstNode"

## Extract all landmarking measures
land.iris <- landmarking(Species ~ ., iris)

## Extract two landmarking measures
landmarking(Species ~ ., iris, features=c("naiveBayes", "oneNN"))

## $naiveBayes
##       mean         sd 
## 0.95333333 0.05488484 
## 
## $oneNN
##     mean       sd 
## 0.940000 0.073367

The performance extraction of these measures without a cross validation step can cause model overfitting in the data. Therefore the landmarking function has the parameter folds to define the number of k-fold cross-validation and the parameter score to select the performance measure. The following example show how to set this value:

## Extract one landmarking measures with folds=2
landmarking(Species ~ ., iris, features="naiveBayes", folds=2)

## $naiveBayes
## mean   sd 
## 0.96 0.00

## Extract one landmarking measures with folds=2
landmarking(Species ~ ., iris, features="naiveBayes", score="kappa")

## $naiveBayes
##       mean         sd 
## 0.93693434 0.07468659

The landmarking measures return a list named by the requested measures. The post.processing methods are applied in these measures since they return multiple values. To define which them should be applied use the summary parameter, as detailed in the post.processing method.

Relative

The relative group is the landmarking with sampling and ranking strategies. The sampling strategy decreases the computational cost of the landmarking by selecting a subsample of the original examples. The ranking strategy capture relative information between the performance of the algorithms. The following examples illustrate these measures:

## Show the the available relative measures
ls.relative()

## [1] "bestNode"    "eliteNN"     "linearDiscr" "naiveBayes"  "oneNN"      
## [6] "randomNode"  "worstNode"

## Extract all relative measures
real.iris <- relative(Species ~ ., iris)

## Extract all relative measures with half of the samples
relative(Species ~ ., iris, size=0.5)

## $bestNode
##          mean sd
## bestNode    3  6
## 
## $eliteNN
##         mean  sd
## eliteNN  4.5 4.5
## 
## $linearDiscr
##             mean  sd
## linearDiscr  6.5 1.5
## 
## $naiveBayes
##            mean  sd
## naiveBayes  6.5 1.5
## 
## $oneNN
##       mean  sd
## oneNN  4.5 4.5
## 
## $randomNode
##            mean sd
## randomNode    2  7
## 
## $worstNode
##           mean sd
## worstNode    1  3

## Extract two relative measures
relative(Species ~ ., iris, features=c("naiveBayes", "oneNN"))

## $naiveBayes
##            mean sd
## naiveBayes    2  1
## 
## $oneNN
##       mean sd
## oneNN    1  2

Clustering

Clustering measures extract information about dataset based on external validation indexes. The main ideia is measure the complexity of the dataset using indexes able to check information about the predictive attributes and the label. The following examples illustrate these measures:

## Show the the available clustering measures
ls.clustering()

## [1] "vdu" "vdb" "int" "sil" "pb"  "ch"  "nre" "sc"

## Extract all clustering measures
clus.iris <- clustering(Species ~ ., iris)

## Extract two clustering measures
clustering(Species ~ ., iris, features=c("vdu", "vdb"))

## $vdu
## [1] 0.05848053
## 
## $vdb
## [1] 0.7513707

Post Processing Methods

Several meta-features generate multiple values and mean and sd are the standard method to summary these values. In order to increase the flexibility, the mfe package implemented the post processing methods to deal with multiple measures values. This method is able to deal with descriptive statistic (resulting in a single value) or a distribution (resulting in multiple values).

The post processing methods are setted using the parameter summary. It is possible to compute min, max, mean, median, kurtosis, standard deviation, among others. Any R method, can be used, as illustrated in the following examples:

## Apply several statistical measures as post processing
statistical(Species ~ ., iris, "cor", 
               summary=c("kurtosis", "max", "mean", "median", "min", "sd", 
                         "skewness", "var"))

## $cor
##   kurtosis        max       mean     median        min         sd   skewness 
## -1.9476130  0.9628654  0.5941160  0.6231906  0.1175698  0.3375443 -0.1814291 
##        var 
##  0.1139362

## Apply quantile as post processing method
statistical(Species ~ ., iris, "cor", summary="quantile")

## $cor
##   quantile.0%  quantile.25%  quantile.50%  quantile.75% quantile.100% 
##     0.1175698     0.3817045     0.6231906     0.8583006     0.9628654

## Get the default values without summarize them
statistical(Species ~ ., iris, "cor", summary=c())

## $cor
## non.aggregated1 non.aggregated2 non.aggregated3 non.aggregated4 non.aggregated5 
##       0.1175698       0.8717538       0.4284401       0.8179411       0.3661259 
## non.aggregated6 
##       0.9628654

Beyond these R default methods, two additional post processing methods are available in the mfe package: hist and non.aggregated. The first computes a histogram of the values and returns the frequencies of in each bins. The extra parameters bins can be used to define the number of values to be returned. The parameters min and max are used to define the range of the data. The second is a way to obtain all values from the measure and has the same effect of the use of an empty list. The following code illustrate examples of the use of these post processing methods:

## Apply histogram as post processing method
statistical(Species ~ ., iris, "cor", summary="hist")

## $cor
##                                                                                                         hist.breaks1 
##                                                                                                                  "0" 
##                                                                                                         hist.breaks2 
##                                                                                                                "0.2" 
##                                                                                                         hist.breaks3 
##                                                                                                                "0.4" 
##                                                                                                         hist.breaks4 
##                                                                                                                "0.6" 
##                                                                                                         hist.breaks5 
##                                                                                                                "0.8" 
##                                                                                                         hist.breaks6 
##                                                                                                                  "1" 
##                                                                                                         hist.counts1 
##                                                                                                                  "1" 
##                                                                                                         hist.counts2 
##                                                                                                                  "1" 
##                                                                                                         hist.counts3 
##                                                                                                                  "1" 
##                                                                                                         hist.counts4 
##                                                                                                                  "0" 
##                                                                                                         hist.counts5 
##                                                                                                                  "3" 
##                                                                                                        hist.density1 
##                                                                                                  "0.833333333333333" 
##                                                                                                        hist.density2 
##                                                                                                  "0.833333333333333" 
##                                                                                                        hist.density3 
##                                                                                                  "0.833333333333333" 
##                                                                                                        hist.density4 
##                                                                                                                  "0" 
##                                                                                                        hist.density5 
##                                                                                                                "2.5" 
##                                                                                                           hist.mids1 
##                                                                                                                "0.1" 
##                                                                                                           hist.mids2 
##                                                                                                                "0.3" 
##                                                                                                           hist.mids3 
##                                                                                                                "0.5" 
##                                                                                                           hist.mids4 
##                                                                                                                "0.7" 
##                                                                                                           hist.mids5 
##                                                                                                                "0.9" 
##                                                                                                           hist.xname 
## "c(0.117569784133002, 0.871753775886583, 0.42844010433054, 0.817941126271576, 0.366125932536439, 0.962865431402796)" 
##                                                                                                        hist.equidist 
##                                                                                                               "TRUE"

## Apply histogram as post processing method and customize it
statistical(Species ~ ., iris, "cor", summary="hist", bins=5, min=0, max=1)

## Warning in plot.window(xlim, ylim, "", ...): "bins" is not a graphical parameter

## Warning in plot.window(xlim, ylim, "", ...): "min" is not a graphical parameter

## Warning in plot.window(xlim, ylim, "", ...): "max" is not a graphical parameter

## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...): "bins"
## is not a graphical parameter

## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...): "min"
## is not a graphical parameter

## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...): "max"
## is not a graphical parameter

## Warning in axis(1, ...): "bins" is not a graphical parameter

## Warning in axis(1, ...): "min" is not a graphical parameter

## Warning in axis(1, ...): "max" is not a graphical parameter

## Warning in axis(2, ...): "bins" is not a graphical parameter

## Warning in axis(2, ...): "min" is not a graphical parameter

## Warning in axis(2, ...): "max" is not a graphical parameter

## $cor
##                                                                                                         hist.breaks1 
##                                                                                                                  "0" 
##                                                                                                         hist.breaks2 
##                                                                                                                "0.2" 
##                                                                                                         hist.breaks3 
##                                                                                                                "0.4" 
##                                                                                                         hist.breaks4 
##                                                                                                                "0.6" 
##                                                                                                         hist.breaks5 
##                                                                                                                "0.8" 
##                                                                                                         hist.breaks6 
##                                                                                                                  "1" 
##                                                                                                         hist.counts1 
##                                                                                                                  "1" 
##                                                                                                         hist.counts2 
##                                                                                                                  "1" 
##                                                                                                         hist.counts3 
##                                                                                                                  "1" 
##                                                                                                         hist.counts4 
##                                                                                                                  "0" 
##                                                                                                         hist.counts5 
##                                                                                                                  "3" 
##                                                                                                        hist.density1 
##                                                                                                  "0.833333333333333" 
##                                                                                                        hist.density2 
##                                                                                                  "0.833333333333333" 
##                                                                                                        hist.density3 
##                                                                                                  "0.833333333333333" 
##                                                                                                        hist.density4 
##                                                                                                                  "0" 
##                                                                                                        hist.density5 
##                                                                                                                "2.5" 
##                                                                                                           hist.mids1 
##                                                                                                                "0.1" 
##                                                                                                           hist.mids2 
##                                                                                                                "0.3" 
##                                                                                                           hist.mids3 
##                                                                                                                "0.5" 
##                                                                                                           hist.mids4 
##                                                                                                                "0.7" 
##                                                                                                           hist.mids5 
##                                                                                                                "0.9" 
##                                                                                                           hist.xname 
## "c(0.117569784133002, 0.871753775886583, 0.42844010433054, 0.817941126271576, 0.366125932536439, 0.962865431402796)" 
##                                                                                                        hist.equidist 
##                                                                                                               "TRUE"

## Extract all correlation values
statistical(Species ~ ., iris, "cor", summary="non.aggregated")

## $cor
## non.aggregated1 non.aggregated2 non.aggregated3 non.aggregated4 non.aggregated5 
##       0.1175698       0.8717538       0.4284401       0.8179411       0.3661259 
## non.aggregated6 
##       0.9628654

It is also possible define an user’s post processing method, like this:

## Compute the absolute difference between the mean and the median 
my.method <- function(x, ...) abs(mean(x) - median(x))

## Using the user defined post processing method
statistical(Species ~ ., iris, "cor", summary="my.method")

## $cor
##  my.method 
## 0.02907459

mfe: Meta-Feature Extractor

Adriano Rivolli, Luis P. F. Garcia and Andre C. P. L. F. de Carvalho

2020-05-05

Introduction

Extracting meta-features