Builtin CPOs can be listed with listCPO()
.
listCPO()[, c("name", "category", "subcategory")]
name | category | subcategory | |
---|---|---|---|
11 | cpoDropConstants | data | cleanup |
12 | cpoDropMostlyConstants | data | cleanup |
37 | cpoFixFactors | data | cleanup |
10 | cpoCollapseFact | data | factor data preprocessing |
4 | cpoAsNumeric | data | feature conversion |
16 | cpoDummyEncode | data | feature conversion |
14 | cpoImpactEncodeClassif | data | feature conversion |
15 | cpoImpactEncodeRegr | data | feature conversion |
13 | cpoProbEncode | data | feature conversion |
56 | cpoQuantileBinNumerics | data | feature conversion |
62 | cpoSelect | data | feature selection |
63 | cpoSelectFreeProperties | data | feature selection |
52 | cpoAddCols | data | features |
51 | cpoMakeCols | data | features |
1 | cpoApplyFun | data | general data preprocessing |
54 | cpoModelMatrix | data | general |
38 | cpoIca | data | numeric data preprocessing |
55 | cpoPca | data | numeric data preprocessing |
59 | cpoScale | data | numeric data preprocessing |
60 | cpoScaleMaxAbs | data | numeric data preprocessing |
61 | cpoScaleRange | data | numeric data preprocessing |
65 | cpoSpatialSign | data | numeric data preprocessing |
17 | cpoFilterFeatures | featurefilter | general |
33 | cpoFilterAnova | featurefilter | specialised |
19 | cpoFilterCarscore | featurefilter | specialised |
29 | cpoFilterChiSquared | featurefilter | specialised |
27 | cpoFilterGainRatio | featurefilter | specialised |
26 | cpoFilterInformationGain | featurefilter | specialised |
34 | cpoFilterKruskal | featurefilter | specialised |
24 | cpoFilterLinearCorrelation | featurefilter | specialised |
18 | cpoFilterMrmr | featurefilter | specialised |
31 | cpoFilterOneR | featurefilter | specialised |
36 | cpoFilterPermutationImportance | featurefilter | specialised |
25 | cpoFilterRankCorrelation | featurefilter | specialised |
30 | cpoFilterRelief | featurefilter | specialised |
22 | cpoFilterRfCImportance | featurefilter | specialised |
23 | cpoFilterRfImportance | featurefilter | specialised |
20 | cpoFilterRfSRCImportance | featurefilter | specialised |
21 | cpoFilterRfSRCMinDepth | featurefilter | specialised |
28 | cpoFilterSymmetricalUncertainty | featurefilter | specialised |
32 | cpoFilterUnivariate | featurefilter | specialised |
35 | cpoFilterVariance | featurefilter | specialised |
39 | cpoImpute | imputation | general |
40 | cpoImputeAll | imputation | general |
41 | cpoImputeConstant | imputation | specialised |
49 | cpoImputeHist | imputation | specialised |
50 | cpoImputeLearner | imputation | specialised |
46 | cpoImputeMax | imputation | specialised |
43 | cpoImputeMean | imputation | specialised |
42 | cpoImputeMedian | imputation | specialised |
45 | cpoImputeMin | imputation | specialised |
44 | cpoImputeMode | imputation | specialised |
48 | cpoImputeNormal | imputation | specialised |
47 | cpoImputeUniform | imputation | specialised |
8 | cpoCache | meta | |
6 | cpoCase | meta | |
9 | cpoCbind | meta | |
5 | cpoMultiplex | meta | |
7 | cpoTransformParams | meta | |
69 | cpoWrap | meta | wrap |
70 | cpoWrapRetrafoless | meta | wrap |
66 | cpoOversample | subsampling | binary classif |
64 | cpoSmote | subsampling | binary classif |
67 | cpoUndersample | subsampling | binary classif |
68 | cpoSample | subsampling | general |
2 | cpoApplyFunRegrTarget | target | general target transformation |
57 | cpoRegrResiduals | target | residual fitting |
3 | cpoLogTrafoRegr | target | target transformation |
53 | cpoMissingIndicators | tools | imputation |
58 | cpoResponseFromSE | tools | predict.type |
NULLCPO
is the neutral element of %>>%
. It is returned by some functions when no other CPO or Retrafo is present.
NULLCPO#> NULLCPO
is.nullcpo(NULLCPO)
#> [1] TRUE
%>>% cpoScale()
NULLCPO #> scale(center = TRUE, scale = TRUE)
%>>% NULLCPO
NULLCPO #> NULLCPO
print(as.list(NULLCPO))
#> list()
pipeCPO(list())
#> NULLCPO
A simple CPO with one parameter which gets applied to the data as CPO. This is different from a multiplexer in that its parameter is free and can take any value that behaves like a CPO. On the downside, this does not expose the argument’s parameters to the outside.
= cpoWrap()
cpa print(cpa, verbose = TRUE)
#> Trafo chain of 1 cpos:
#> wrap()
#> Operating: feature
#> ParamSet:
#> Type len Def Constr Req Tunable Trafo
#> wrap.cpo untyped - - - - TRUE -
head(iris %>>% setHyperPars(cpa, wrap.cpo = cpoScale()))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 -0.8976739 1.01560199 -1.335752 -1.311052 setosa
#> 2 -1.1392005 -0.13153881 -1.335752 -1.311052 setosa
#> 3 -1.3807271 0.32731751 -1.392399 -1.311052 setosa
#> 4 -1.5014904 0.09788935 -1.279104 -1.311052 setosa
#> 5 -1.0184372 1.24503015 -1.335752 -1.311052 setosa
#> 6 -0.5353840 1.93331463 -1.165809 -1.048667 setosa
head(iris %>>% setHyperPars(cpa, wrap.cpo = cpoPca()))
#> Species PC1 PC2 PC3 PC4
#> 1 setosa -2.684126 -0.3193972 0.02791483 0.002262437
#> 2 setosa -2.714142 0.1770012 0.21046427 0.099026550
#> 3 setosa -2.888991 0.1449494 -0.01790026 0.019968390
#> 4 setosa -2.745343 0.3182990 -0.03155937 -0.075575817
#> 5 setosa -2.728717 -0.3267545 -0.09007924 -0.061258593
#> 6 setosa -2.280860 -0.7413304 -0.16867766 -0.024200858
# attaching the cpo applicator to a learner gives this learner a "cpo" hyperparameter
# that can be set to any CPO.
getParamSet(cpoWrap() %>>% makeLearner("classif.logreg"))
#> Type len Def Constr Req Tunable Trafo
#> wrap.cpo untyped - - - - TRUE -
#> model logical - TRUE - - FALSE -
Combine many CPOs into one, with an extra selected.cpo
parameter that chooses between them.
= cpoMultiplex(list(cpoScale, cpoPca))
cpm print(cpm, verbose = TRUE)
#> Trafo chain of 1 cpos:
#> multiplex(selected.cpo = scale, scale.center = TRUE, scale.scale = TRUE, pca.center = TRUE, pca.scale = FALSE)
#> Operating: feature
#> ParamSet:
#> Type len Def Constr Req Tunable Trafo
#> selected.cpo discrete - scale scale,pca - TRUE -
#> scale.center logical - TRUE - Y TRUE -
#> scale.scale logical - TRUE - Y TRUE -
#> pca.center logical - TRUE - Y TRUE -
#> pca.scale logical - FALSE - Y TRUE -
head(iris %>>% setHyperPars(cpm, selected.cpo = "scale"))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 -0.8976739 1.01560199 -1.335752 -1.311052 setosa
#> 2 -1.1392005 -0.13153881 -1.335752 -1.311052 setosa
#> 3 -1.3807271 0.32731751 -1.392399 -1.311052 setosa
#> 4 -1.5014904 0.09788935 -1.279104 -1.311052 setosa
#> 5 -1.0184372 1.24503015 -1.335752 -1.311052 setosa
#> 6 -0.5353840 1.93331463 -1.165809 -1.048667 setosa
# every CPO's Hyperparameters are exported
head(iris %>>% setHyperPars(cpm, selected.cpo = "scale", scale.center = FALSE))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 0.8613268 1.1296201 0.3362663 0.140405 setosa
#> 2 0.8275493 0.9682458 0.3362663 0.140405 setosa
#> 3 0.7937718 1.0327956 0.3122473 0.140405 setosa
#> 4 0.7768830 1.0005207 0.3602853 0.140405 setosa
#> 5 0.8444380 1.1618950 0.3362663 0.140405 setosa
#> 6 0.9119931 1.2587196 0.4083234 0.280810 setosa
head(iris %>>% setHyperPars(cpm, selected.cpo = "pca"))
#> Species PC1 PC2 PC3 PC4
#> 1 setosa -2.684126 -0.3193972 0.02791483 0.002262437
#> 2 setosa -2.714142 0.1770012 0.21046427 0.099026550
#> 3 setosa -2.888991 0.1449494 -0.01790026 0.019968390
#> 4 setosa -2.745343 0.3182990 -0.03155937 -0.075575817
#> 5 setosa -2.728717 -0.3267545 -0.09007924 -0.061258593
#> 6 setosa -2.280860 -0.7413304 -0.16867766 -0.024200858
A CPO that builds data-dependent CPO networks. This is a generalized CPO-Multiplexer that takes a function which decides (from the data, and from user-specified hyperparameters) what CPO operation to perform. Besides optional arguments, the used CPO’s Hyperparameters are exported as well. This is a generalization of cpoMultiplex
; however, requires
of the involved parameters are not adjusted, since this is impossible in principle.
= cpoCase(pSS(logical.param: logical),
s.and.p export.cpos = list(cpoScale(),
cpoPca()),
cpo.build = function(data, target, logical.param, scale, pca) {
if (logical.param || mean(data[[1]]) > 10) {
%>>% pca
scale else {
} %>>% scale
pca
}
})print(s.and.p, verbose = TRUE)
#> Trafo chain of 1 cpos:
#> case(scale.center = TRUE, scale.scale = TRUE, pca.center = TRUE, pca.scale = FALSE)
#> Operating: feature
#> ParamSet:
#> Type len Def Constr Req Tunable Trafo
#> logical.param logical - - - - TRUE -
#> scale.center logical - TRUE - - TRUE -
#> scale.scale logical - TRUE - - TRUE -
#> pca.center logical - TRUE - - TRUE -
#> pca.scale logical - FALSE - - TRUE -
The resulting CPO s.and.p
performs scaling and PCA, with the order depending on the parameter logical.param
and on whether the mean of the data’s first column exceeds 10. If either of those is true, the data will be first scaled, then PCA’d, otherwise the order is reversed. The all CPOs listed in .export
are passed to the cpo.build
.
cbind
other CPOs as operation. The cbinder
makes it possible to build DAGs of CPOs that perform different operations on data and paste the results next to each other.
= cpoScale(id = "scale")
scale = scale %>>% cpoPca()
scale.pca = cpoCbind(scaled = scale, pcad = scale.pca, original = NULLCPO) cbinder
# cpoCbind recognises that "scale.scale" happens before "pca.pca" but is also fed to the
# result directly. The summary draws a (crude) ascii-art graph.
print(cbinder, verbose = TRUE)
#> Trafo chain of 1 cpos:
#> cbind(scale.center = TRUE, scale.scale = TRUE, pca.center = TRUE, pca.scale = FALSE)
#> Operating: feature
#> ParamSet:
#> Type len Def Constr Req Tunable Trafo
#> scale.center logical - TRUE - - TRUE -
#> scale.scale logical - TRUE - - TRUE -
#> pca.center logical - TRUE - - TRUE -
#> pca.scale logical - FALSE - - TRUE -
#> O>+ scale(center = TRUE, scale = TRUE)
#> | |
#> +<O pca(center = TRUE, scale = FALSE)[not exp'd: tol = <NULL>, rank = <NULL>]
#> |
#> O CBIND[scaled,pcad,original]
#>
head(iris %>>% cbinder)
#> scaled.Sepal.Length scaled.Sepal.Width scaled.Petal.Length scaled.Petal.Width
#> 1 -0.8976739 1.01560199 -1.335752 -1.311052
#> 2 -1.1392005 -0.13153881 -1.335752 -1.311052
#> 3 -1.3807271 0.32731751 -1.392399 -1.311052
#> 4 -1.5014904 0.09788935 -1.279104 -1.311052
#> 5 -1.0184372 1.24503015 -1.335752 -1.311052
#> 6 -0.5353840 1.93331463 -1.165809 -1.048667
#> scaled.Species pcad.Species pcad.PC1 pcad.PC2 pcad.PC3 pcad.PC4
#> 1 setosa setosa -2.257141 -0.4784238 0.12727962 0.024087508
#> 2 setosa setosa -2.074013 0.6718827 0.23382552 0.102662845
#> 3 setosa setosa -2.356335 0.3407664 -0.04405390 0.028282305
#> 4 setosa setosa -2.291707 0.5953999 -0.09098530 -0.065735340
#> 5 setosa setosa -2.381863 -0.6446757 -0.01568565 -0.035802870
#> 6 setosa setosa -2.068701 -1.4842053 -0.02687825 0.006586116
#> original.Sepal.Length original.Sepal.Width original.Petal.Length
#> 1 5.1 3.5 1.4
#> 2 4.9 3.0 1.4
#> 3 4.7 3.2 1.3
#> 4 4.6 3.1 1.5
#> 5 5.0 3.6 1.4
#> 6 5.4 3.9 1.7
#> original.Petal.Width original.Species
#> 1 0.2 setosa
#> 2 0.2 setosa
#> 3 0.2 setosa
#> 4 0.2 setosa
#> 5 0.2 setosa
#> 6 0.4 setosa
# the unnecessary copies of "Species" are unfortunate. Remove them with cpoSelect:
= cpoSelect(type = "numeric")
selector = cpoCbind(scaled = selector %>>% scale, pcad = selector %>>% scale.pca, original = NULLCPO)
cbinder.select
cbinder.select#> cbind(scale.center = TRUE, scale.scale = TRUE, pca.center = TRUE, pca.scale = FALSE)
head(iris %>>% cbinder)
#> scaled.Sepal.Length scaled.Sepal.Width scaled.Petal.Length scaled.Petal.Width
#> 1 -0.8976739 1.01560199 -1.335752 -1.311052
#> 2 -1.1392005 -0.13153881 -1.335752 -1.311052
#> 3 -1.3807271 0.32731751 -1.392399 -1.311052
#> 4 -1.5014904 0.09788935 -1.279104 -1.311052
#> 5 -1.0184372 1.24503015 -1.335752 -1.311052
#> 6 -0.5353840 1.93331463 -1.165809 -1.048667
#> scaled.Species pcad.Species pcad.PC1 pcad.PC2 pcad.PC3 pcad.PC4
#> 1 setosa setosa -2.257141 -0.4784238 0.12727962 0.024087508
#> 2 setosa setosa -2.074013 0.6718827 0.23382552 0.102662845
#> 3 setosa setosa -2.356335 0.3407664 -0.04405390 0.028282305
#> 4 setosa setosa -2.291707 0.5953999 -0.09098530 -0.065735340
#> 5 setosa setosa -2.381863 -0.6446757 -0.01568565 -0.035802870
#> 6 setosa setosa -2.068701 -1.4842053 -0.02687825 0.006586116
#> original.Sepal.Length original.Sepal.Width original.Petal.Length
#> 1 5.1 3.5 1.4
#> 2 4.9 3.0 1.4
#> 3 4.7 3.2 1.3
#> 4 4.6 3.1 1.5
#> 5 5.0 3.6 1.4
#> 6 5.4 3.9 1.7
#> original.Petal.Width original.Species
#> 1 0.2 setosa
#> 2 0.2 setosa
#> 3 0.2 setosa
#> 4 0.2 setosa
#> 5 0.2 setosa
#> 6 0.4 setosa
# alternatively, we apply the cbinder only to numerical data
head(iris %>>% cpoWrap(cbinder, affect.type = "numeric"))
#> Species scaled.Sepal.Length scaled.Sepal.Width scaled.Petal.Length
#> 1 setosa -0.8976739 1.01560199 -1.335752
#> 2 setosa -1.1392005 -0.13153881 -1.335752
#> 3 setosa -1.3807271 0.32731751 -1.392399
#> 4 setosa -1.5014904 0.09788935 -1.279104
#> 5 setosa -1.0184372 1.24503015 -1.335752
#> 6 setosa -0.5353840 1.93331463 -1.165809
#> scaled.Petal.Width pcad.PC1 pcad.PC2 pcad.PC3 pcad.PC4
#> 1 -1.311052 -2.257141 -0.4784238 0.12727962 0.024087508
#> 2 -1.311052 -2.074013 0.6718827 0.23382552 0.102662845
#> 3 -1.311052 -2.356335 0.3407664 -0.04405390 0.028282305
#> 4 -1.311052 -2.291707 0.5953999 -0.09098530 -0.065735340
#> 5 -1.311052 -2.381863 -0.6446757 -0.01568565 -0.035802870
#> 6 -1.048667 -2.068701 -1.4842053 -0.02687825 0.006586116
#> original.Sepal.Length original.Sepal.Width original.Petal.Length
#> 1 5.1 3.5 1.4
#> 2 4.9 3.0 1.4
#> 3 4.7 3.2 1.3
#> 4 4.6 3.1 1.5
#> 5 5.0 3.6 1.4
#> 6 5.4 3.9 1.7
#> original.Petal.Width
#> 1 0.2
#> 2 0.2
#> 3 0.2
#> 4 0.2
#> 5 0.2
#> 6 0.4
cpoTransformParams
wraps another CPO
and sets some of its hyperparameters to the value of expressions depending on other hyperparameter values. This can be used to make a transformation of parameters similar to the trafo
parameter of a Param
in ParamHelpers
, but it can also be used to set multiple parameters at the same time, depending on a single new parameter.
= cpoTransformParams(cpoPca(), alist(pca.scale = pca.center))
cpo = pid.task %>|% setHyperPars(cpo, pca.center = FALSE)
retr getCPOTrainedState(retr)$control # both 'center' and 'scale' are FALSE
#> CPO Retrafo chain
#> [RETRAFO pca(center = FALSE, scale = FALSE)]
= cpoMultiplex(list(cpoIca(export = "n.comp"), cpoPca(export = "rank")))
mplx !mplx
#> Trafo chain of 1 cpos:
#> multiplex(selected.cpo = ica, ica.n.comp = <NULL>, pca.rank = <NULL>)
#> Operating: feature
#> ParamSet:
#> Type len Def Constr Req Tunable Trafo
#> selected.cpo discrete - ica ica,pca - TRUE -
#> ica.n.comp integer - <NULL> 1 to Inf Y TRUE -
#> pca.rank integer - <NULL> 1 to Inf Y TRUE -
= cpoTransformParams(mplx, alist(ica.n.comp = comp, pca.rank = comp),
mtx pSS(comp: integer[1, ]), list(comp = 1))
head(iris %>>% setHyperPars(mtx, selected.cpo = "ica", comp = 2))
#> Species V1 V2
#> 1 setosa 0.5040262 1.372772
#> 2 setosa -0.5026081 1.277214
#> 3 setosa -0.4470063 1.369134
#> 4 setosa -0.7903465 1.261004
#> 5 setosa 0.5165524 1.396033
#> 6 setosa 1.3797295 1.270769
head(iris %>>% setHyperPars(mtx, selected.cpo = "pca", comp = 3))
#> Species PC1 PC2 PC3
#> 1 setosa -2.684126 -0.3193972 0.02791483
#> 2 setosa -2.714142 0.1770012 0.21046427
#> 3 setosa -2.888991 0.1449494 -0.01790026
#> 4 setosa -2.745343 0.3182990 -0.03155937
#> 5 setosa -2.728717 -0.3267545 -0.09007924
#> 6 setosa -2.280860 -0.7413304 -0.16867766
Implements the base::scale
function.
= data.frame(a = 1:3, b = -(1:3) * 10)
df %>>% cpoScale()
df #> a b
#> 1 -1 1
#> 2 0 0
#> 3 1 -1
%>>% cpoScale(scale = FALSE) # center = TRUE
df #> a b
#> 1 -1 10
#> 2 0 0
#> 3 1 -10
Implements stats::prcomp
. No scaling or centering is performed.
%>>% cpoPca()
df #> PC1 PC2
#> 1 -10.04988 4.163336e-16
#> 2 0.00000 0.000000e+00
#> 3 10.04988 -4.163336e-16
Dummy encoding of factorial variables. Optionally uses the first factor as reference variable.
head(iris %>>% cpoDummyEncode())
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Speciessetosa
#> 1 5.1 3.5 1.4 0.2 1
#> 2 4.9 3.0 1.4 0.2 1
#> 3 4.7 3.2 1.3 0.2 1
#> 4 4.6 3.1 1.5 0.2 1
#> 5 5.0 3.6 1.4 0.2 1
#> 6 5.4 3.9 1.7 0.4 1
#> Speciesversicolor Speciesvirginica
#> 1 0 0
#> 2 0 0
#> 3 0 0
#> 4 0 0
#> 5 0 0
#> 6 0 0
head(iris %>>% cpoDummyEncode(reference.cat = TRUE))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Speciesversicolor
#> 1 5.1 3.5 1.4 0.2 0
#> 2 4.9 3.0 1.4 0.2 0
#> 3 4.7 3.2 1.3 0.2 0
#> 4 4.6 3.1 1.5 0.2 0
#> 5 5.0 3.6 1.4 0.2 0
#> 6 5.4 3.9 1.7 0.4 0
#> Speciesvirginica
#> 1 0
#> 2 0
#> 3 0
#> 4 0
#> 5 0
#> 6 0
Select to use only certain columns of a dataset. Select by column index, name, or regex pattern.
head(iris %>>% cpoSelect(pattern = "Width"))
#> Sepal.Width Petal.Width
#> 1 3.5 0.2
#> 2 3.0 0.2
#> 3 3.2 0.2
#> 4 3.1 0.2
#> 5 3.6 0.2
#> 6 3.9 0.4
# selection is additive
head(iris %>>% cpoSelect(pattern = "Width", type = "factor"))
#> Sepal.Width Petal.Width Species
#> 1 3.5 0.2 setosa
#> 2 3.0 0.2 setosa
#> 3 3.2 0.2 setosa
#> 4 3.1 0.2 setosa
#> 5 3.6 0.2 setosa
#> 6 3.9 0.4 setosa
Drops constant features or numerics, with variable tolerance
head(iris) %>>% cpoDropConstants() # drops 'species'
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 5.1 3.5 1.4 0.2
#> 2 4.9 3.0 1.4 0.2
#> 3 4.7 3.2 1.3 0.2
#> 4 4.6 3.1 1.5 0.2
#> 5 5.0 3.6 1.4 0.2
#> 6 5.4 3.9 1.7 0.4
head(iris) %>>% cpoDropConstants(abs.tol = 0.2) # also drops 'Petal.Width'
#> Sepal.Length Sepal.Width Petal.Length
#> 1 5.1 3.5 1.4
#> 2 4.9 3.0 1.4
#> 3 4.7 3.2 1.3
#> 4 4.6 3.1 1.5
#> 5 5.0 3.6 1.4
#> 6 5.4 3.9 1.7
Drops unused factors and makes sure prediction data has the same factor levels as training data.
levels(iris$Species)
#> [1] "setosa" "versicolor" "virginica"
= head(iris) %>>% cpoFixFactors() # Species only has level 'setosa' in train
irisfix levels(irisfix$Species)
#> [1] "setosa"
= retrafo(irisfix)
rf c(1, 100, 140), ]
iris[#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 100 5.7 2.8 4.1 1.3 versicolor
#> 140 6.9 3.1 5.4 2.1 virginica
c(1, 100, 140), ] %>>% rf
iris[#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 100 5.7 2.8 4.1 1.3 <NA>
#> 140 6.9 3.1 5.4 2.1 <NA>
Creates columns indicating missing data. Most useful in combination with cpoCbind.
= df
impdata 1]][1] = NA
impdata[[
impdata#> a b
#> 1 NA -10
#> 2 2 -20
#> 3 3 -30
%>>% cpoMissingIndicators()
impdata #> a
#> 1 TRUE
#> 2 FALSE
#> 3 FALSE
%>>% cpoCbind(NULLCPO, dummy = cpoMissingIndicators())
impdata #> a b dummy.a
#> 1 NA -10 TRUE
#> 2 2 -20 FALSE
#> 3 3 -30 FALSE
Apply an univariate function to data columns
head(iris %>>% cpoApplyFun(function(x) sqrt(x) - 10, affect.type = "numeric"))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 -7.741682 -8.129171 -8.816784 -9.552786 setosa
#> 2 -7.786406 -8.267949 -8.816784 -9.552786 setosa
#> 3 -7.832052 -8.211146 -8.859825 -9.552786 setosa
#> 4 -7.855239 -8.239318 -8.775255 -9.552786 setosa
#> 5 -7.763932 -8.102633 -8.816784 -9.552786 setosa
#> 6 -7.676210 -8.025158 -8.696160 -9.367544 setosa
Convert (non-numeric) features to numeric
head(iris[sample(nrow(iris), 10), ] %>>% cpoAsNumeric())
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 14 4.3 3.0 1.1 0.1 1
#> 50 5.0 3.3 1.4 0.2 1
#> 118 7.7 3.8 6.7 2.2 3
#> 43 4.4 3.2 1.3 0.2 1
#> 150 5.9 3.0 5.1 1.8 3
#> 148 6.5 3.0 5.2 2.0 3
Combine low prevalence factors. Set max.collapsed.class.prevalence
how big the combined factor level may be.
= iris
iris2 $Species = factor(c("a", "b", "c", "b", "b", "c", "b", "c",
iris2as.character(iris2$Species[-(1:8)])))
head(iris2, 10)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 a
#> 2 4.9 3.0 1.4 0.2 b
#> 3 4.7 3.2 1.3 0.2 c
#> 4 4.6 3.1 1.5 0.2 b
#> 5 5.0 3.6 1.4 0.2 b
#> 6 5.4 3.9 1.7 0.4 c
#> 7 4.6 3.4 1.4 0.3 b
#> 8 5.0 3.4 1.5 0.2 c
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
head(iris2 %>>% cpoCollapseFact(max.collapsed.class.prevalence = 0.2), 10)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 collapsed
#> 2 4.9 3.0 1.4 0.2 collapsed
#> 3 4.7 3.2 1.3 0.2 collapsed
#> 4 4.6 3.1 1.5 0.2 collapsed
#> 5 5.0 3.6 1.4 0.2 collapsed
#> 6 5.4 3.9 1.7 0.4 collapsed
#> 7 4.6 3.4 1.4 0.3 collapsed
#> 8 5.0 3.4 1.5 0.2 collapsed
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
Specify which columns get used, and how they are transformed, using a formula
.
head(iris %>>% cpoModelMatrix(~0 + Species:Petal.Width))
#> Speciessetosa:Petal.Width Speciesversicolor:Petal.Width
#> 1 0.2 0
#> 2 0.2 0
#> 3 0.2 0
#> 4 0.2 0
#> 5 0.2 0
#> 6 0.4 0
#> Speciesvirginica:Petal.Width
#> 1 0
#> 2 0
#> 3 0
#> 4 0
#> 5 0
#> 6 0
# use . + ... to retain originals
head(iris %>>% cpoModelMatrix(~0 + . + Species:Petal.Width))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Speciessetosa
#> 1 5.1 3.5 1.4 0.2 1
#> 2 4.9 3.0 1.4 0.2 1
#> 3 4.7 3.2 1.3 0.2 1
#> 4 4.6 3.1 1.5 0.2 1
#> 5 5.0 3.6 1.4 0.2 1
#> 6 5.4 3.9 1.7 0.4 1
#> Speciesversicolor Speciesvirginica Petal.Width:Speciesversicolor
#> 1 0 0 0
#> 2 0 0 0
#> 3 0 0 0
#> 4 0 0 0
#> 5 0 0 0
#> 6 0 0 0
#> Petal.Width:Speciesvirginica
#> 1 0
#> 2 0
#> 3 0
#> 4 0
#> 5 0
#> 6 0
scale values to a given range
head(iris %>>% cpoScaleRange(-1, 1))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 -0.5555556 0.25000000 -0.8644068 -0.9166667 setosa
#> 2 -0.6666667 -0.16666667 -0.8644068 -0.9166667 setosa
#> 3 -0.7777778 0.00000000 -0.8983051 -0.9166667 setosa
#> 4 -0.8333333 -0.08333333 -0.8305085 -0.9166667 setosa
#> 5 -0.6111111 0.33333333 -0.8644068 -0.9166667 setosa
#> 6 -0.3888889 0.58333333 -0.7627119 -0.7500000 setosa
Multiply features to set the maximum absolute value.
head(iris %>>% cpoScaleMaxAbs(0.1))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 0.06455696 0.07954545 0.02028986 0.008 setosa
#> 2 0.06202532 0.06818182 0.02028986 0.008 setosa
#> 3 0.05949367 0.07272727 0.01884058 0.008 setosa
#> 4 0.05822785 0.07045455 0.02173913 0.008 setosa
#> 5 0.06329114 0.08181818 0.02028986 0.008 setosa
#> 6 0.06835443 0.08863636 0.02463768 0.016 setosa
Normalize values row-wise
head(iris %>>% cpoSpatialSign())
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 0.8037728 0.5516088 0.2206435 0.03152050 setosa
#> 2 0.8281329 0.5070201 0.2366094 0.03380134 setosa
#> 3 0.8053331 0.5483119 0.2227517 0.03426949 setosa
#> 4 0.8000302 0.5391508 0.2608794 0.03478392 setosa
#> 5 0.7909650 0.5694948 0.2214702 0.03163860 setosa
#> 6 0.7841750 0.5663486 0.2468699 0.05808704 setosa
There are two general and many specialised imputation CPOs. The general imputation CPOs have parameters that let them use different imputation methods on different columns. They are a thin wrapper around mlr
’s impute()
and reimpute()
functions. The specialised imputation CPOs each implement exactly one imputation method and are closer to the behaviour of typical CPOs.
cpoImpute
and cpoImputeAll
both have parameters very much like impute()
. The latter assumes that all columns of its input is somehow being imputed and can be preprended to a learner to give it the ability to work with missing data. It will, however, throw an error if data is missing after imputation.
%>>% cpoImpute(cols = list(a = imputeMedian()))
impdata #> a b
#> 1 2.5 -10
#> 2 2.0 -20
#> 3 3.0 -30
%>>% cpoImpute(cols = list(b = imputeMedian())) # NAs remain
impdata #> a b
#> 1 NA -10
#> 2 2 -20
#> 3 3 -30
%>>% cpoImputeAll(cols = list(b = imputeMedian())) # error, since NAs remain
impdata #> Error in assertPropertiesOk(present.properties, setdiff(allowed.properties, : Data returned by CPO trafo has property missings that impute declared in .properties.adding.
#> properties in .properties.adding may not be present in trafo output.
= makeRegrTask("missing.task", impdata, target = "b")
missing.task # the following gives an error, since 'cpoImpute' does not make sure all missings are removed
# and hence does not add the 'missings' property.
train(cpoImpute(cols = list(a = imputeMedian())) %>>% makeLearner("regr.lm"), missing.task)
#> Error in checkLearnerBeforeTrain(task, learner, weights): Task 'missing.task' has missing values in 'a', but learner 'regr.lm.impute' does not support that!
# instead, the following works:
train(cpoImputeAll(cols = list(a = imputeMedian())) %>>% makeLearner("regr.lm"), missing.task)
#> Model for learner.id=regr.lm.impute; learner.class=CPOLearner
#> Trained on: task.id = missing.task; obs = 3; features = 1
#> Hyperparameters: impute.target.cols=character(0),impute.classes=list(),impute.cols=a=<ImputeMethod>,impute.dummy.classes=character(0),impute.dummy.cols=character(0),impute.dummy.type=factor,impute.force.dummies=FALSE,impute.impute.new.levels=TRUE,impute.recode.factor.levels=TRUE
There is one for each imputation method.
%>>% cpoImputeConstant(10)
impdata #> a b
#> 1 10 -10
#> 2 2 -20
#> 3 3 -30
getTaskData(missing.task %>>% cpoImputeMedian())
#> a b
#> 1 2.5 -10
#> 2 2.0 -20
#> 3 3.0 -30
# The specialised impute CPOs are:
listCPO()[listCPO()$category == "imputation" & listCPO()$subcategory == "specialised",
c("name", "description")]
#> name
#> 41 cpoImputeConstant
#> 49 cpoImputeHist
#> 50 cpoImputeLearner
#> 46 cpoImputeMax
#> 43 cpoImputeMean
#> 42 cpoImputeMedian
#> ... (#rows: 10, #cols: 1)
There is one general and many specialised feature filtering CPOs. The general filtering CPO, cpoFilterFeatures
, is a thin wrapper around filterFeatures
and takes the filtering method as its argument. The specialised CPOs each call a specific filtering method.
Most arguments of filterFeatures
are reflected in the CPOs. The exceptions being: 1. for filterFeatures
, the filter method arguments are given in a list filter.args
, instead of in ...
2. The argument fval
was dropped for the specialised filter CPOs. 3. The argument mandatory.feat
was dropped. Use affect.*
parameters to prevent features from being filtered.
head(getTaskData(iris.task %>>% cpoFilterFeatures(method = "variance", perc = 0.5)))
#> Sepal.Length Petal.Length Species
#> 1 5.1 1.4 setosa
#> 2 4.9 1.4 setosa
#> 3 4.7 1.3 setosa
#> 4 4.6 1.5 setosa
#> 5 5.0 1.4 setosa
#> 6 5.4 1.7 setosa
head(getTaskData(iris.task %>>% cpoFilterVariance(perc = 0.5)))
#> Sepal.Length Petal.Length Species
#> 1 5.1 1.4 setosa
#> 2 4.9 1.4 setosa
#> 3 4.7 1.3 setosa
#> 4 4.6 1.5 setosa
#> 5 5.0 1.4 setosa
#> 6 5.4 1.7 setosa
# The specialised filter CPOs are:
listCPO()[listCPO()$category == "featurefilter" & listCPO()$subcategory == "specialised",
c("name", "description")]
#> name
#> 33 cpoFilterAnova
#> 19 cpoFilterCarscore
#> 29 cpoFilterChiSquared
#> 27 cpoFilterGainRatio
#> 26 cpoFilterInformationGain
#> 34 cpoFilterKruskal
#> ... (#rows: 19, #cols: 1)