First Steps (No Output)

What is mlrCPO?

“Composable Preprocessing Operators”, “CPO”, are an extension for the mlr (“Machine Learning in R”) project which present preprocessing operations in the form of R objects. These CPO objects can be composed to form complex operations, they can be applied to data sets, and can be attached to mlr Learner objects to generate machine learning pipelines that combine preprocessing and model fitting.

What is Preprocessing

“Preprocessing”, as understood by mlrCPO, is any manipulation of data used in a machine learning process to get it from its form as found in the wild into a form more fitting for the machine learning algorithm (“Learner”) used for model fitting. It is important that the exact method of preprocessing is kept track of, to be able to perform this method when the resulting model is used to make predictions on new data. It is also important, when evaluating preprocessing methods e.g. using resampling, that the parameters of these methods are independent of the validation dataset and only depend on the training data set.

mlrCPO tries to support the user in all these aspects of preprocessing:

By providing a large set of atomic preprocessing CPOs that can perform many different operations. Operations that go beyond the provided toolset can be implemented in custom CPOs.
By using “CPOTrained” objects that represent the preprocessing done on training data that should, in that way, be re-applied to new prediction data.
By making it possible to combine preprocessing objects with mlr “Learner” objects that represent the entinre machine learning pipeline to be tuned and evaluated.

Preprocessing Operations

At the centre of mlrCPO are “CPO” objects. To get a CPO object, it is necessary to call a CPO Constructor. A CPO Constructor sets up the parameters of a CPO and provides further options for its behaviour. Internally, CPO Constructors are functions that have a common interface and a friendly printer method.

cpoScale  # a cpo constructor

cpoAddCols

cpoScale(center = FALSE)  # create a CPO object that scales, but does not center, data

cpoAddCols(Sepal.Area = Sepal.Length * Sepal.Width)  #  this would add a column

CPOs exist first to be applied to data. Every CPO represents a certain data transformation, and this transformation is performed when the CPO is applied. This can be done using the applyCPO function, or the %>>% operator. CPOs can be applied to data.frame objects, and to mlr “Task” objects.

iris.demo = iris[c(1, 2, 3, 51, 52, 102, 103), ]
tail(iris.demo %>>% cpoQuantileBinNumerics())  # bin the data in below & above median

A useful feature of CPOs is that they can be concatenated to form new operations. Two CPOs can be combined using the composeCPO function or, as before, the %>>% operator. When two CPOs are combined, the product is a new CPO that can itself be composed or applied. The result of a composition represents the operation of first applying the first CPO and then the second CPO. Therefore, data %>>% (cpo1 %>>% cpo2) is the same as (data %>>% cpo1) %>>% cpo2.

# first create three quantile bins, then as.numeric() all columns to
# get 1, 2 or 3 as the bin number
quantilenum = cpoQuantileBinNumerics(numsplits = 3) %>>% cpoAsNumeric()
iris.demo %>>% quantilenum

The last example shows that it is sometimes not a good idea to have a CPO affect the whole dataset. Therefore, when a CPO is created, it is possible to choose what columns the CPO should affect. The CPO Constructor has a variety of parameters, starting with affect., that can be used to choose what columns the CPO operates on. To prevent cpoAsNumeric from influencing the Species column, we can thus do

quantilenum.restricted = cpoQuantileBinNumerics(numsplits = 3) %>>%
  cpoAsNumeric(affect.names = "Species", affect.invert = TRUE)
iris.demo %>>% quantilenum.restricted

A more convenient method in this case, however, is to use an mlr “Task”, which keeps track of the target column. “Feature Operation” CPOs (as all the ones shown) do not influence the target column.

demo.task = makeClassifTask(data = iris.demo, target = "Species")
result = demo.task %>>% quantilenum
getTaskData(result)

Hyperparameters

When performing preprocessing, it is sometimes necessary to change a small aspect of a long preprocessing pipeline. Instead of having to re-construct the whole pipeline, mlrCPO offers the possibility to change hyperparameters of a CPO. This makes it very easy e.g. for tuning of preprocessing in combination with a machine learning algorithm.

Hyperparameters of CPOs can be manipulated in the same way as they are manipulated for Learners in mlr, using getParamSet (to list the parameters), getHyperPars (to list the parameter values), and setHyperPars (to change these values). To get the parameter set of a CPO, it is also possible to use verbose printing using the ! (exclamation mark) operator.

cpo = cpoScale()
cpo

getHyperPars(cpo)  # list of parameter names and values

getParamSet(cpo)  # more detailed view of parameters and their type / range

!cpo  # equivalent to print(cpo, verbose = TRUE)

CPOs use copy semantics, therefore setHyperPars creates a copy of a CPO that has the changed hyperparameters.

cpo2 = setHyperPars(cpo, scale.scale = FALSE)
cpo2

iris.demo %>>% cpo  # scales and centers

iris.demo %>>% cpo2 # only centers

When chaining many CPOs, it is possible for the many hyperparameters to lead to very cluttered ParamSets, or even for hyperparameter names to clash. mlrCPO has two remedies for that.

First, any CPO also has an id that is always prepended to the hyperparameter names. It can be set during construction, using the id parameter, or changed later using setCPOId. The latter one only works on primitive, i.e. not compound, CPOs. Set the id to NULL to use the CPO’s hyperparameters without a prefix.

cpo = cpoScale(id = "a") %>>% cpoScale(id = "b")  # not very useful example
getHyperPars(cpo)

The second remedy against hyperparameter clashes is different “exports” of hyperparameters: The hyperparameters that can be changed using setHyperPars, i.e. that are exported by a CPO, are a subset of the parameters of the CPOConstructor. For each kind of CPO, there is a standard set of parameters that are exported, but during construction, it is possible to influence the parameters that actually get exported via the export parameter. export can be one of a set of standard export settings (among them “export.all” and “export.none”) or a character vector of the parameters to export.

cpo = cpoPca(export = c("center", "rank"))
getParamSet(cpo)

Retrafo

Manipulating data for preprocessing itself is relatively easy. A challenge comes when one wants to integrate preprocessing into a machine-learning pipeline: The same preprocessing steps that are performed on the training data need to be performed on the new prediction data. However, the transformation performed for prediction often needs information from the training step. For example, if training entail performing PCA, then for prediction, the data must not undergo another PCA, instead it needs to be rotated by the rotation matrix found by the training PCA. The process of obtaining the rotation matrix will be called “training” the CPO, and the object that contains the trained information is called CPOTrained. For preprocessing operations that operate only on features of a task (as opposed to the target column), the CPOTrained will always be applied to new incoming data, and hence be of class CPORetrafo and called a “retrafo” object. To obtain this retrafo object, one can use retrafo(). Retrafo objects can be applied to data just as CPOs can, by using the %>>% operator.

transformed = iris.demo %>>% cpoPca(rank = 3)
transformed

ret = retrafo(transformed)
ret

To show that ret actually represents the exact same preprocessing operation, we can feed the first line of iris.demo back to it, to verify that the transformation is the same.

iris.demo[1, ] %>>% ret

We obviously would not have gotten there by feeding the first line to cpoPca directly:

iris.demo[1, ] %>>% cpoPca(rank = 3)

CPOTrained objects associated with an object are automatically chained when another CPO is applied. To prevent this from happening, it is necessary to “clear” the retrafos and inverters associated with the object using clearRI().

t2 = transformed %>>% cpoScale()
retrafo(t2)

t3 = clearRI(transformed) %>>% cpoScale()
retrafo(t3)

Note that clearRI has no influence on the CPO operations themselves, and the resulting data is the same:

all.equal(t2, t3, check.attributes = FALSE)

It is also possible to chain CPOTrained object using composeCPO() or %>>%. This can be useful if the trafo chain loses access to the retrafo attribute for some reason. In general, it is only recommended to compose CPOTrained objects that were created in the same process and in correct order, since they are usually closely associated with the training data in a particular place within the preprocessing chain.

retrafo(transformed) %>>% retrafo(t3)  # is the same as retrafo(t2) above.

Inverter

So far only CPOs were introduced that change the feature columns of a Task. (“Feature Operation CPOs”–FOCPOs). There is another class of CPOs, “Target Operation CPOs” or TOCPOs, that can change a Task’s target columns.

This comes at the cost of some complexity when performing prediction: Since the training data that was ultimately fed into a Learner had a transformed target column, the predictions made by the resulting model will not be directly comparable to the original target values. Consider cpoLogTrafoRegr, a CPO that log-transforms the target variable of a regression Task. The predictions made with a Learner on a log-transformed target variable will be in log-space and need to be exponentiated (or otherwise re-transformed). This inversion operation is represented by an “inverter” object that is attached to a transformation result similarly to a retrafo object, and can be obtained using the inverter() function. It is of class CPOInverter, a subclass of CPOTrained.

iris.regr = makeRegrTask(data = iris.demo, target = "Petal.Width")
iris.logd = iris.regr %>>% cpoLogTrafoRegr()

getTaskData(iris.logd)  # log-transformed target 'Petal.Width'

inv = inverter(iris.logd)  # inverter object
inv

The inverter object is used by the invert() function that inverts the prediction made by a model trained on the transformed task, and re-transforms this prediction to fit the space of the original target data. The inverter object caches the “truth” of the data being inverted (iris.logd, in the example), so invert can give information on the truth of the inverted data.

logmodel = train("regr.lm", iris.logd)
pred = predict(logmodel, iris.logd)  # prediction on the task itself
pred

invert(inv, pred)

This procedure can also be done with new incoming data. In general, more than just the cpoLogTrafoRegr operation could be done on the iris.regr task in the example, so to perform the complete preprocessing and inversion, one needs to use the retrafo object as well. When applying the retrafo object, a new inverter object is generated, which is specific to the exact new data that was being retransformed:

newdata = makeRegrTask("newiris", iris[7:9, ], target = "Petal.Width",
  fixup.data = "no", check.data = FALSE)

# the retrafo does the same transformation(s) on newdata that were
# done on the training data of the model, iris.logd. In general, this
# could be more than just the target log transformation.
newdata.transformed = newdata %>>% retrafo(iris.logd)
getTaskData(newdata.transformed)

pred = predict(logmodel, newdata.transformed)
pred

# the inverter of the newly transformed data contains information specific
# to the newly transformed data. In the current case, that is just the
# new "truth" column for the new data.
inv.newdata = inverter(newdata.transformed)
invert(inv.newdata, pred)

Constant Inverters

The cpoLogTrafoRegr is a special case of TOCPO in that its inversion operation is constant: It does not depend on the new incoming data, so in theory it is not necessary to get a new inverter object for every piece of data that is being transformed. Therefore, it is possible to use the retrafo object for inversion in this case. However, the “truth” column will not be available in this case:

invert(retrafo(iris.logd), pred)

Whether a retrafo object is capable of performing inversion can be checked with the getCPOTrainedCapability() function. It returns a vector with named elements "retrafo" and "invert", indicating whether a CPOTrained is capable of performing retrafo or inversion. A 1 indicates that the object can perform the action and has an effect, a 0 indicates that the action would have no effect (but also throws no error), and a -1 means that the object is not capable of performing the action.

getCPOTrainedCapability(retrafo(iris.logd))  # can do both retrafo and inversion

getCPOTrainedCapability(inv)  # a pure inverter, can not be used for retrafo

General Inverters

As an example of a CPO that does not have a constant inverter, consider cpoRegrResiduals, wich fits a regression model on training data and returns the residuals of this fit. When performing prediction, the invert action is to add predictions by the CPO’s model to the incoming predictions made by a model trained on the residuals.

set.seed(123)  # for reproducibility
iris.resid = iris.regr %>>% cpoRegrResiduals("regr.lm")
getTaskData(iris.resid)

model.resid = train("regr.randomForest", iris.resid)

newdata.resid = newdata %>>% retrafo(iris.resid)
getTaskData(newdata.resid)  # Petal.Width are now the residuals of lm model predictions

pred = predict(model.resid, newdata.resid)
pred

# transforming this prediction back to compare
# it to the original 'Petal.Width'
inv.newdata = inverter(newdata.resid)
invert(inv.newdata, pred)

Retrafoless CPOs

Besides FOCPOs and TOCPOs, there are also “Retrafoless” CPOs (ROCPOs). These only perform operation in the training part of a machine learning pipeline, but in turn are the only CPOs that may change the number of rows in a dataset. The goal of ROCPOs is to change the number of data samples, but not to transform the data or target values themselves. Examples of ROCPOs are cpoUndersample, cpoSmote, and cpoSample.

sampled = iris %>>% cpoSample(size = 3)
sampled

There is no retrafo or inverter associated with the result. Instead, both of them are NULLCPO

retrafo(sampled)
inverter(sampled)

CPO Learners

Until now, the CPOs have been invoked explicitly to manipulate data and get retrafo and inverter objects. It is good to be aware of the data flows in a machine learning process involving preprocessing, but mlrCPO makes it very easy to automatize this. It is possible to attach a CPO to a Learner using attachCPO or the %>>%-operator. When a CPO is attached to a Learner, a CPOLearner is created. The CPOLearner performs the preprocessing operation dictated by the CPO before training the underlying model, and stores and uses the retrafo and inverter objects necessary during prediction. It is possible to attach compound CPOs, and it is possible to attach further CPOs to a CPOLearner to extend the preprocessing pipeline. Exported hyperparamters of a CPO are also present in a CPOLearner and can be changed using setHyperPars, as usual with other Learner objects.

Recreating the pipeline from General Inverters with a CPOLearner looks like the following. Note the prediction pred made in the end is identical with the one made above.

set.seed(123)  # for reproducibility
lrn = cpoRegrResiduals("regr.lm") %>>% makeLearner("regr.randomForest")
lrn

model = train(lrn, iris.regr)

pred = predict(model, newdata)
pred

It is possible to get the retrafo object from a model trained with a CPOLearner using the retrafo() function. In this example, it is identical with the retrafo(iris.resid) gotten in the example in General Inverters.

retrafo(model)

CPO Tuning

Since the hyperparameters of a CPO are present in a CPOLearner, is possible to tune hyperparameters of preprocessing operations. It can be done using mlr’s tuneParams() function and works identically to tuning common Learner-parameters.

icalrn = cpoIca() %>>% makeLearner("classif.logreg")

getParamSet(icalrn)

ps = makeParamSet(
    makeIntegerParam("ica.n.comp", lower = 1, upper = 8),
    makeDiscreteParam("ica.alg.typ", values = c("parallel", "deflation")))
# shorter version using pSS:
# ps = pSS(ica.n.comp: integer[1, 8], ica.alg.typ: discrete[parallel, deflation])

tuneParams(icalrn, pid.task, cv5, par.set = ps,
  control = makeTuneControlGrid(),
  show.info = FALSE)

Syntactic Sugar

Besides the %>>% operator, there are a few related operators which are short forms of operations that otherwise take more typing.

%<<% is similar to %>>% but works in the other direction. a %>>% b is the same as b %<<% a.
%<>>% and %<<<% are the %>>% or %<<% operators, combined with assignment. a %<>>% b is the same as a = a %>>% b. These operators perform the operations on their right before they do the assignment, so it is not necessary to use parentheses when writing a = a %>>% b %>>% c as a %<>>% b %>>% c.
%>|% and %|<% feed data in a CPO and gets the retrafo(). data %>|% a is the same as retrafo(data %>>% a). The %>|% operator performs the operation on its right before getting the retrafo, so it is not necessary to use parentheses when writing retrafo(data %>>% a %>>% b) as data %>|% a %>>% b.

Inspecting CPOs

As described before, it is possible to compose CPOs to create relatively complex preprocessing pipelines. It is therefore necessary to have tools to inspect a CPO pipeline or related objects.

The first line of attack when inspecting a CPO is always the print function. print(x, verbose = TRUE) will often print more information about a CPO than the ordinary print function. A shorthand alias for this is the exclamation point “!”. When verbosely printing a CPOConstructor, the transformation functions are shown. When verbosely printing a CPO, the constituent elements are separately printed, each showing their parameter sets.

cpoAsNumeric  # plain print
!cpoAsNumeric  # verbose print

cpoScale() %>>% cpoIca()  # plain print
!cpoScale() %>>% cpoIca()  # verbose print

When working with compound CPOs, it is sometimes necessary to manipulate a CPO inside a compound CPO pipeline. For this purpose, the as.list() generic is implemented for both CPO and CPOTrained for splitting a pipeline into a list of the primitive elements. The inverse is pipeCPO(), which takes a list of CPO or CPOTrained and concatenates them using composeCPO().

as.list(cpoScale() %>>% cpoIca())

pipeCPO(list(cpoScale(), cpoIca()))

CPOTrained objects contain information about the retrafo or inversion to be performed for a CPO. It is possible to access this information using getCPOTrainedState(). The “state” of a CPOTrained object often contains a $data slot with information about the expected input and output format (“ShapeInfo”) of incoming data, a slot for each of its hyperparameters, and a $control slot that is specific to the CPO in question. The cpoPca state, for example, contains the PCA rotation matrix and a vector for scaling and centering. The contents of a state’s $control object are described in a CPO’s help page.

repca = retrafo(iris.demo %>>% cpoPca())
state = getCPOTrainedState(repca)
state

It is even possible to change the “state” of a CPOTrained and construct a new CPOTrained using makeCPOTrainedFromState(). This is fairly advanced usage and only recommended for users familiar with the inner workings of the particular CPO. If we get familiar with the cpoPca CPO using the !-print (i.e. !cpoPca) to look at the retrafo function, we notice that the control$center and control$scale values are given to a call of scale(). If we want to create a new CPOTrained that does not perform centering or scaling during before applying the rotation matrix, we can change these values.

state$control$center = FALSE
state$control$scale = FALSE
nosc.repca = makeCPOTrainedFromState(cpoPca, state)

Comparing this to the original “repca” retrafo shows that the result of applying repca has generally smaller values because of the centering.

iris.demo %>>% repca

iris.demo %>>% nosc.repca

Special CPOs

There is a large and growing variety of CPOs that perform many different operations. It is advisable to browse through CPOs Built Into mlrCPO for an overview. To get a list of all built-in CPOs, use listCPO(). A few important or “meta” CPOs that can be used to influence the behaviour of other CPOs are described here.

NULLCPO

The value associated with “no operation” is the NULLCPO value. It is the neutral element of the %>>% operations, and the value of retrafo() and inverter() when there are otherwise no associated retrafo or inverter values.

NULLCPO

all.equal(iris %>>% NULLCPO, iris)
cpoPca() %>>% NULLCPO

CPO Multiplexer

The multiplexer makes it possible to combine many CPOs into one, with an extra selected.cpo parameter that chooses between them. This makes it possible to tune over many different tuner configurations at once.

cpm = cpoMultiplex(list(cpoIca, cpoPca(export = "export.all")))
!cpm

iris.demo %>>% setHyperPars(cpm, selected.cpo = "ica", ica.n.comp = 3)

iris.demo %>>% setHyperPars(cpm, selected.cpo = "pca", pca.rank = 3)

CPO Wrapper

A simple CPO with one parameter which gets applied to the data as CPO. This is different from a multiplexer in that its parameter is free and can take any value that behaves like a CPO. On the downside, this does not expose the argument’s parameters to the outside.

cpa = cpoWrap()
!cpa

iris.demo %>>% setHyperPars(cpa, wrap.cpo = cpoScale())

iris.demo %>>% setHyperPars(cpa, wrap.cpo = cpoPca())

Attaching the cpo applicator to a learner gives this learner a “cpo” hyperparameter that can be set to any CPO.

getParamSet(cpoWrap() %>>% makeLearner("classif.logreg"))

CBind CPO

cbind other CPOs as operation. The cbinder makes it possible to build DAGs of CPOs that perform different operations on data and paste the results next to each other. It is often useful to combine cpoCbind with cpoSelect to filter out columns that would otherwise be duplciated.

scale = cpoSelect(pattern = "Sepal", id = "first") %>>% cpoScale(id = "scale")
scale.pca = scale %>>% cpoPca()
cbinder = cpoCbind(scale, scale.pca, cpoSelect(pattern = "Petal", id = "second"))

cpoCbind recognises that "scale" happens before "pca", but is also fed to the result directly. The verbose print draws a (crude) ascii-art graph.

!cbinder

iris.demo %>>% cbinder

Custom CPOs

Even though CPOs are very flexible and can be combined in many ways, it may be necessary to create completely custom CPOs. Custom CPOs can be created using the makeCPO() and related functions. “Building Custom CPOs” is a wide topic which has its own vignette.

Summary

CPOs are built using CPOConstructors by calling them like functions.
The available CPOConstructors can be found by using listCPO() or consulting the relevant vignette.
Verbose printing of CPOs and many related objects is available using the ! (exclamation mark) operator.
CPOs export hyperparameters that are accessible using getParamSet() and getHyperPars(), and mutable using setHyperPars(). Which parameters are exported can be controlled using the export parameter during construction.
They can be composed (composeCPO()), applied to data (applyCPO()) and attached to Learners (attachCPO()) using special functions for each of these operations, or using the general %>>% operator.
There are three fundamental kinds of CPO: FOCPO (Feature Operation CPOs), TOCPO (Target Operation CPOs) and ROCPO (Retrafoless CPOs). The first may only change feature columns, the second only target columns. While the last one may change both feature and target values and even the number of rows of a dataset, it does so with the understanding that new “prediction” data will not be transformed by it and is thus mainly useful for subsampling.
Data that was transformed by a (non-Retrafoless) CPO has a retrafo-CPOTrained object associated with it that can be retrieved using retrafo() and used to transform new prediction data in similar way as the original training data.
CPOTrained objects can themselves be composed using composeCPO or %>>%, although it is only recommended to compose CPOTrained objects in the same order as they were created, and only if they were created in the same preprocessing pipeline.
CPOTrained objects can be inspected using getCPOTrainedState(), and re-built with changed state using makeCPOTrainedFromState().
Data that was transformed by a TOCPO also has an inverter associated with it, which can be retrieved using inverter(). An inverter is also created during application of a retrafo CPOTrained.
While retrafo CPOTrained are created during training and used on every prediction data set, inverter CPOTrained are created anew during each CPO and retrafo-CPOTrained application and are closely associated with the data that they were created with.
CPOTrained objects associated with data are stored in their “attributes” and are automatically chained when more CPOs are applied. clearRI() is used to remove the associated CPOTrained objects and prevent this chaining.
CPOs can be attached to Learners to get CPOLearners which automatically transform training and prediction data and perform prediction inversion.
CPOLearners have the Learner’s and the CPO’s hyperparameters and can thus be manipulated using setHyperPars(), and can be tuned using tuneParams().
Notable CPOs are NULLCPO (the neutral element of %>>%), cpoMultiplex, cpoWrap, and cpoCbind.
It is possible to create custom CPOs using makeCPO and similar functions. These are described in their own vignette.