The CPO
s built into mlrCPO
can be used for many different purposes, and can be combined to form even more powerful transformation operations. However, in some cases, it may be necessary to define new “custom” CPO
s that perform a certain task; either because a preprocessing method is not (yet) defined as a builtin CPO
, or because some operation very specific to the task at hand needs to be performed.
For this purpose, mlrCPO
offers a very powerful interface for the creation of new CPO
s. The functions and methods described here are also the methods used internally to create mlrCPO
’s builtin CPO
s. Therefore, to learn the art of defining CPO
s, it is also possible to look at the mlrCPO
source tree in files starting with “CPO_
” for example CPO
definitions.
There are three types of CPO
: “Feature Operation CPO
s” (FOCPOs) which are only allowed to change feature columns of incoming data, and which are the most common CPO
s; “Target Operation CPO
s” (TOCPOs) that change only target columns, and “Retrafoless CPO
s” (ROCPOs) that may add or delete rows to a data set, but only during training. Conceptually, ROCPOs are the simplest CPO
s, followed by FOCPOs and the even more complicated TOCPOs. The commonalities of all CPO
defining functions will be described first, followed by the different CPO
types in order of growing complexity.
To create a CPOConstructor
that can then be used to create a CPO
, a makeCPO*()
function needs to be called. There are five functions of this kind, differing by what kind of CPO
they create and how much flexibility (at the cost of simplicity) they offer the user:
CPO type |
makeCPO*() functions |
---|---|
FOCPO | makeCPO() , makeCPOExtendedTrafo() |
TOCPO | makeCPOTargetOp() , makeCPOExtendedTargetOp() |
ROCPO | makeCPORetrafoless() |
Each of these functions takes a “name” for the new CPO
, settings for the parameter set to be used, settings for the format in which the data is supposed to be provided, data property settings, the packages to load, CPO
type specific settins, and finally the transformation functions.
Each CPO
has a “name” that is used for representation when printing, and as the default prefix for hyperparameters. cpoPca
, for example, has the name “pca
”:
!cpoPca()
The name is set using the cpo.name
parameter of the make*()
functions.
The ParSet
used by the CPO
are given as the second par.set
parameter. These parameters must be either constructed using makeParamSet()
from the ParamHelpers
package, or using the pSS()
function for a more concise ParSet
definition. The given parameters will then be the function parameters of the CPOConstructor
, and will by default be exported as hyperparameters (prefixed with the cpo.name
).
It is possible to use the default parameter values of the par.set
as defaults, or to give a par.vals
list of default values. If par.vals
is given, the defaults within par.set
are completely ignored. Parameters that have a default value are set to this value upon construction if no value is given by the user.
Not all available parameters of a CPO
need to be exported as hyperparameters. Which parameters are exported can be set during CPO
construction, but the default exported parameters can be set using export.params
. This can either be a character
vector of the names of parameters to export, or TRUE
(default, export all) or FALSE
(no export).
Different CPO
operations may want to operate on the data in different forms: as a Task
, as a data.frame
with or without the target column, etc. The CPO
framework can perform some conversion of data to fit different needs, which is set up by the value of fthe dataformat
parameter, together with dataformat.factor.with.ordered
. While dataformat
has slightly different effects on different CPO
types, typically its values and effects are:
dataformat |
Effect |
---|---|
"task" |
Data is given as a Task ; if the data to be transformed is a data.frame , it is converted to a cluster task before handing it to the transformation functions. |
"df.all" |
Data is given as a data.frame , with the target column included. |
"df.features" |
Data is given as a data.frame , the target is given as a separate data.frame . |
"split" |
Data is given as a named list with slots $numeric , $factor , $ordered , $other , each of which contains a data.frame with the columns of the respective type. If dataformat.factor.with.ordered is TRUE , the $ordered slot is not present, and ordered features are instead given to $factor as well. Features that are not any of these types are given to "other" . The target is given as a separate data.frame . |
"factor" , "ordered" , "numeric" |
Only the data from columns of the named type are given to the transformatin functions as a data.frame . The target columns are given as a separate data.frame . |
Another parameter influencing the data format is the fix.factors
flag which controls whether factor levels of prediction data need to be set to be the same as during training. If it is TRUE
, previously unseen factor levels are set to NA
during prediction.
mlr
and mlrCPO
make it possible to specify what kind of data a CPO
or a Learner
can handle. However, since CPO
s may change data to be more or less fitting for a certain Learner
, a CPO
must announce not only what data it can handle, but also how it changes the capabilities of the machine learning pipeline in which it is envolved. During construction, four parameters related to properties can be given.
The properties.data
parameter defines what properties of feature data the CPO
can handle; it must be a subset of "numerics"
, "factors"
, "ordered"
, and "missings"
. Typically, only the "missings"
part is interesting since CPO
s that only handle a subset of types will usually just ignore columns of other types.
The properties.target
parameter defines what Task
properties related to the task type and the target column a CPO
can handle. It is a subset of "cluster"
, "classif"
, "multilabel"
, "regr"
, "surv"
(so far defining the task type a CPO
can handle), "oneclass"
, "twoclass"
, "multiclass"
(properties specific to classif
Task
s). Most FOCPOs do not care about the task type, while TOCPOs may only support a single task type.
properties.adding
lists the properties that a CPO adds to the capabilities of a machine learning pipeline when it is executed before it, while properties.needed
lists the properties needed from the following pipeline. cpoDummyEncode
, for example, a CPO
that converts factors and ordereds to numerics, has properties.adding == c("factors", "ordered")
and properties.needed == "numerics"
. The many imputation CPO
s have properties.adding == "missings"
. Usually these are only a subset of the possible properties.data
states, but for TOCPOs this may also be any of "oneclass"
, "twoclass"
, "multiclass"
. Note that neither properties.adding
nor properties.needed
may be any task type, even for TOCPOs that perform task conversion.
.sometimes
PropertiesThe CPO
framework will check that a CPO
only adds and removes the kind of data properties that it declared in properties.adding
and properties.needed
. It will also check that composition of CPO
s, and attachment of CPO
s to Learner
s, work out. Sometimes, however, it is necessary to treat a CPO
like it does a certain manipulation (removing missings
, for example) in some cases, while not in others. A CPO
that only imputes missings in numeric columns should be treated as properties.adding == "missings"
when is is attached to a Learner
, and the Learner
should gain the "missings"
property. However, when data that has missings in its factorial columns is given to this CPO
, the CPO
framework will complain that the CPO
that declared "missings"
in properties.adding
returned data that still had missing values in it. The solution to this dilemma is to suffix some properties with “.sometimes
” when declaring them in properties.adding
and properties.needed
. When composing CPO
s, and when checking data returned by a CPO
, the framework will then be as lenient as possible. In the given example, properties.adding == "missings"
will be assumed when attaching the CPO
to a Learner
, while properties.adding == character(0)
is assumed when checking the CPO
’s output (and missing values that were not imputed are therefore forgiven).
The single packages
parameter can be set to a character
vector listing packages necessary for a CPO
to work. This is mostly useful when a CPO
should be defined as part of a package or script to be distributed. The listed package will not automatically be attached, it will only be loaded. This means that a function exported by a package still needs to be called using ::
. The benefit of declaring it in packages
is that it will be loaded upon construction of a CPO
, which means that a user will get immediate feedback about whether the CPO
can be used or needs more packages to be installed.
The different types of CPO
, and the different make*()
functions, need different transformation functions to be defined. The principle behind these functions is alwasy the same, however: The CPO
framework takes input data, transforms it according to dataformat
, checks it according to properties.data
and properties.target
, and then gives it to one or more user-given transformation function. The transformation function must then usually create a control object containing information about the data to be used later, or transform the incoming data and return the transformation result (or both). The CPO
framework then checks the transformed data according to properties.adding
and properties.needed
and gives it back to the CPO
user.
Transformation functions are given to parameters starting with cpo.
. They can either be given as functions, or as “headless” functions missing the function(...)
part. In the latter case, the headless function must be a succession of expressions enclosed in curly braces ({
, }
) and the necessary function head is added by the CPO
framework. The functions often take a subset of data
, target
, control
, or control.invert
parameters, in addition to all parameters as given in par.set
.
The communication between transformation functions, e.g. giving the PCA matrix to its retrafo function, usually happens via “control” objects created by these functions and then given as parameter to other functions. In some cases, however, it may be more elegant to create a new function (e.g. a cpo.retrafo
function) within another function as a “closure” (in the general, not R specific, sense) with access to all the outer functions variables. The CPO
framework makes this possible by allowing a function to be given instead of a “control” object. The function which would usually receive this control object must then be given as NULL
in the makeCPO*()
call.
Retrafoless CPO
s, or ROCPOs, are conceptually the simplest CPO
type, since they do not create CPOTrained
objects and therefore only need one transformation function: cpo.trafo
. The value of the dataformat
parameter may only be either "df.all"
or "task"
, resulting in either a data.frame
(consisting all columns, including the target column) or a Task
being given to the cpo.trafo
function. cpo.trafo
should have the parameters data
(receiving the data as either a Task
or data.frame
), target
(receiving the names of target columns in the data), and any parameter as given to par.set
. The return value of cpo.trafo
must be the transformed data, in the same format (data.frame
or Task
) as given as input.
Since a ROCPO only transforms incoming data during training, it should not do any transformation of target or feature values that would make it necessary to repeat this action during prediction. It may, for example, be used for subsampling a classification task to balance target classes, but it should not change the levels or values of given data rows.
The following is an example of a simplified version of the cpoSample
CPO
, which takes one parameter fraction
and then subsamples a fraction
part of incoming data without replacement:
= makeCPORetrafoless("exsample", # nolint
xmpSample pSS(fraction: numeric[0, 1]),
dataformat = "df.all",
cpo.trafo = function(data, target, fraction) {
= round(nrow(data) * fraction)
newsize = sample(nrow(data), newsize)
row.indices
data[row.indices, ]
})
= xmpSample(0.01) cpo
%>>% cpo iris
It is possible to give the cpo.trafo
as headless transformation function by just leaving out the function header. This can save a lot of boilerplate code when there are many parameters present, or when many transformation functions need to be given. The resulting CPO
is completely equivalent to the one given above.
= makeCPORetrafoless("exsample", # nolint
xmpSampleHeadless pSS(fraction: numeric[0, 1]),
dataformat = "df.all",
cpo.trafo = {
= round(nrow(data) * fraction)
newsize = sample(nrow(data), newsize)
row.indices
data[row.indices, ] })
FOCPOs are created with either the makeCPO()
function, or the makeCPOExtendedTrafo()
function. The former conceptually separates training from transformation, the latter separates transformation of training data from transformation of prediction data.
makeCPO()
In principle, a FOCPO needs a function that “trains” a control object depending on the data (cpo.train
), and another function that uses this control object, and new data, to perform the preprocessing operation (cpo.retrafo
). The cpo.train
-function must return a “control” object which contains all information about how to transform a given dataset. cpo.retrafo
takes a (potentially new!) dataset and the “control” object returned by cpo.trafo
, and transforms the new data according to plan.
In contrast to makeCPORetrafoless()
, the dataformat
parameter of makeCPO()
can take all values described in the section Data Format. The cpo.train
function takes the arguments data
, target
, and any other parameter described in param.set
. The data
value is the incoming data as a Task
, a data.frame
with or without the target column, or a list of data.frames
of different column types, according to dataformat
. The target
value is a character
vector of target names if dataformat
is "task"
or "df.all"
, or a data.frame
of the target columns otherwise.
The cpo.train
function’s return value is treated as a control
object and given to the cpo.retrafo
function. Its parameters are data
, control
, and any parameters in par.set
. The format of the data given to the data
parameter is according to dataformat
, with the exception that if dataformat
is either "task"
or "df.all"
, it will be treated here as if its value were "df.features"
. This is because the cpo.retrafo
function is sometimes called with prediction data which does not have any target column at all.
It follows the simplified definition of a CPO
that removes the numeric columns of smallest variance, returning a dataset of only n.col
numeric columns. The dataformat
variable is set to "numeric"
, so that only numeric columns are given to the CPO
’s transformation functiosn; factorial columns are ignored. In cpo.trafo
, calculates the variance of each of the data’s columns, and in cpo.retrafo
it subsets the data according to these variances. Since cpo.retrafo
may also be called during prediction with new data, the variance must not be calculated in cpo.retrafo
–this could lead to cpo.retrafo
filtering out different columns from cpo.trafo
. This example also prints out which of its functions are being called.
= makeCPO("exemplvar", # nolint
xmpFilterVar pSS(n.col: integer[0, ]),
dataformat = "numeric",
cpo.train = function(data, target, n.col) {
cat("*** cpo.train ***\n")
sapply(data, var, na.rm = TRUE)
},cpo.retrafo = function(data, control, n.col) {
cat("*** cpo.retrafo ***\n")
cat("Control:\n")
print(control)
cat("\n")
= order(-control) # columns, ordered greatest to smallest var
greatest seq_len(n.col)]]
data[greatest[
})
= xmpFilterVar(2) cpo
(Note that the function heads are optional.)
When the CPO
is called with a dataset, the cpo.train
function is called first, creating the control object which is then given to cpo.retrafo
.
trafd = head(iris) %>>% cpo) (
Note that the two columns of the entire iris
dataset with the greatest variance are Petal.Length
and Sepal.Length
:
head(iris %>>% cpo)
However, when applying the retrafo()
of trafd
to the entire dataset, the same columns are filtered out as they were in the first transformation: Sepal.Width
and Sepal.Length
. When the retrafo()
is used, cpo.train
is not called; instead, the control
object saved inside the retrafo is used.
head(iris %>>% retrafo(trafd))
It is also possible to inspect the CPOTrained
object to see that the control
is there:
getCPOTrainedState(retrafo(trafd))
Instead of returning the control
object, cpo.train
may also return the cpo.retrafo
function. This may be more succinct to write if there are many little pieces of information from the cpo.train
run that the cpo.retrafo
function should have access to.
When cpo.retrafo
is given functionally, it should be a function with only one parameter: the newly incoming data. It can access the values of the par.set
parameters from its encapsulating environment in cpo.train
.
Note that the data
and target
values given to cpo.train
are deleted after the cpo.train
call, so cpo.retrafo
does not have access to it. In fact, the CPO
framework will give a warning about this.
= makeCPO("exemplvar.func", # nolint
xmpFilterVarFunc pSS(n.col: integer[0, ]),
dataformat = "numeric",
cpo.retrafo = NULL,
cpo.train = function(data, target, n.col) {
cat("*** cpo.train ***\n")
= sapply(data, var, na.rm = TRUE)
ctrl function(x) { # the data is given to the only present parameter: 'x'
cat("*** cpo.retrafo ***\n")
cat("Control:\n")
print(ctrl)
cat("\ndata:\n")
print(data) # 'data' is deleted: NULL
cat("target:\n")
print(target) # 'target' is deleted: NULL
= order(-ctrl) # columns, ordered greatest to smallest var
greatest seq_len(n.col)]]
x[greatest[
}
})
= xmpFilterVarFunc(2) cpo
(Note that the function heads are optional.)
trafd = head(iris) %>>% cpo) (
The CPOTrained
state for a functional CPO
is the environment of the retrafo function. It contains the “ctrl
” variable defined during training, the parameters given to cpo.train
, and the cpo.retrafo
function itself. Note that data
and target
are deleted and replaced by different values.
getCPOTrainedState(retrafo(trafd))
“Stateless” CPO
s are CPO
s that perform the same action during transformation of training and prediction data, independent from information during training. An example would be a CPO
that converts all its columns to numeric
columns. When a FOCPO does not need a state, the cpo.train
parameter of makeCPO()
can be set to NULL
. The cpo.retrafo
function then has no control
paramter and instead only a data
and any par.set
parameter. The as.numeric
-CPO
could be written as the following:
= makeCPO("asnum", # nolint
xmpAsNum cpo.train = NULL,
cpo.retrafo = function(data) {
data.frame(lapply(data, as.numeric))
})
= xmpAsNum() cpo
(Note that the function head is optional.)
trafd = head(iris) %>>% cpo) (
The “state” of the CPOTrained
object thus created only contains information about the incoming data shape, to make sure that the CPOTrained
object is only used on conforming data (as doing otherwise would indicate a bug).
getCPOTrainedState(retrafo(trafd))
makeCPOExtendedTrafo()
Sometimes it is advantageous to have the training operation return the transformed data right away. PCA, for example, returns the rotation matrix and the transformed data; it would be a waste of time to only return the rotation matrix in a cpo.train
function and apply it on the training data in cpo.retrafo
. The makeCPOExtendedTrafo()
function works very much like makeCPO()
, with the difference that it has a cpo.trafo
instead of a cpo.train
function parameter. The cpo.trafo
takes the same parameters as cpo.train
, but returns the transformed data instead of a control object. The control object needs to be created additionally, as a variable by the cpo.trafo
function. The CPO
framework takes the value of a variable named control
inside the cpo.trafo
function and gives it to the cpo.retrafo
function.
The following is a simplified version of the cpoPca
CPO
, which does not scale or center the data.
= makeCPOExtendedTrafo("simple.pca", # nolint
xmpPca pSS(n.col: integer[0, ]),
dataformat = "numeric",
cpo.trafo = function(data, target, n.col) {
cat("*** cpo.trafo ***\n")
= prcomp(as.matrix(data), center = FALSE, scale. = FALSE, rank = n.col)
pcr # save the rotation matrix as 'control' variable
= pcr$rotation
control $x
pcr
},cpo.retrafo = function(data, control, n.col) {
cat("*** cpo.retrafo ***\n")
# rotate the data by the rotation matrix
as.matrix(data) %*% control
})
= xmpPca(2) cpo
When this CPO
is applied to data, only the cpo.trafo
function is called.
trafd = head(iris) %>>% cpo) (
When the retrafo CPOTrained
is used, the cpo.retrafo
function is called, making use of the rotation matrix.
tail(iris) %>>% retrafo(trafd)
The rotation matrix can be inspected using getCPOTrainedState
.
getCPOTrainedState(retrafo(trafd))
As with makeCPO()
, makeCPOExtendedTrafo()
makes it possible to define functional CPO
s. Instead of returning a cpo.retrafo
function, the cpo.retrafo
function needs to be defined as a variable, instead of a “control
” variable. Like in makeCPO()
, the cpo.retrafo
parameter of makeCPOExtendedTrafo()
must then be NULL
. The PCA example above could thus also be written as
= makeCPOExtendedTrafo("simple.pca.func", # nolint
xmpPcaFunc pSS(n.col: integer[0, ]),
dataformat = "numeric",
cpo.retrafo = NULL,
cpo.trafo = function(data, target, n.col) {
cat("*** cpo.trafo ***\n")
= prcomp(as.matrix(data), center = FALSE, scale. = FALSE, rank = n.col)
pcr # save the rotation matrix as 'control' variable
= function(data) {
cpo.retrafo cat("*** cpo.retrafo ***\n")
# rotate the data by the rotation matrix
as.matrix(data) %*% pcr$rotation
}$x
pcr
})
= xmpPcaFunc(2) cpo
trafd = head(iris) %>>% cpo) (
This also serves as an example of the disadvantages of a functional CPO
: Since the CPO
state contains all the information contained in the cpo.trafo
call (except the data
and target
variables), it may take up more memory than needed. For this CPO
, the state contains the pcr
variable which contains the transformed training data in its $x
slot. If the training data is a very large dataset, this would result in CPO
states that take up a lot of working memory.
getCPOTrainedState(retrafo(trafd))$pcr$x
TOCPOs are more complicated than FOCPOs, since they potentially need to operate on data at three different points: During initial training, during the re-transformation for new prediction data, and during the inversion of predictions made by a model trained on transformed data. Similarly to makeCPO()
, makeCPOTargetOp()
splits these operations up into functions that create “control
” objects, and functions that do the actual transformation. makeCPOExtendedTargetOp()
, on the other hand, gives the user more flexibility at the price of the user having to make sure that transformation and retransformation perform the same operation–similarly to makeCPOExtendedTrafo()
for FOCPOs.
In contrast to FOCPOs, TOCPOs can only operate on one type of Task
. Therefore, the properties.target
parameter of makeCPO*TargetOp()
must contain exactly one Task
type ("cluster"
, "classif"
, "regr"
, "surv"
, "multilabel"
) and possibly some more task properties (currently only "oneclass"
, "twoclass"
, "multiclass"
if the Task
type is "classif"
).
It is possible to write TOCPOs that perform conversion of Task
types. For that, the task.type.out
parameter must be set to the Task
type that the CPO
converts the data to. If conversion happens, the transformation functions need to return target data fit for the task.type.out
Task
type.
properties.adding
and properties.needed
should not be any Task
type, even when conversion happens. Only if one of the task types has additional properties–currently only the "oneclass"
, "twoclass"
, "multiclass"
properties of classification Task
s–should these additional properties be listed in properties.adding
or properties.needed
.
predict.type
mlr
makes it possible for Learner
s to make different kinds of prediction. Usually they can predict a “response”, making their best effort to predict the true value of a task target. Many Learner
types can predict a probability when their predict.type
is set to "prob"
, returning a data.frame
of their estimated probability distribution over possible responses. For regression Learner
s, predict.type
can be "se"
for the Learner
to predict its estimated standard error of their response prediction.
When TOCPOs invert these predictions, they may
predict.type
predictions they can performpredict.type
they require from the underlying Learner
to make this predict.type
prediction.This is done using the predict.type.map
parameter of makeCPO*TargetOp()
. It is a named list
or named character
vector with the names indicating the supported predict.type
s, and the values indicating the required underlying predictions. For example, if a TOCPO can perform "response"
and "se"
prediction, and to predict "response"
the underlying Learner
must also perform "response"
prediction, but for "se"
prediction it must perform "prob"
prediction, the predict.type.map
would have the value
c(response = "response", se = "prob")
makeCPOTargetOp()
makeCPOTargetOp()
has a cpo.train
and cpo.retrafo
function parameter that work similarly to the ones of makeCPO()
. In contrast to makeCPO()
, however, cpo.retrafo
must return the target data instead of the feature data. The data
and target
parameters of cpo.retrafo
get the same data as they get in a FOCPO created with makeCPO()
, with the exception that if dataformat
is "task"
or "df.all"
, the target
parameter will receive the whole input data in form of a Task
or data.frame
(while the data
argument, as in a FOCPO, will receive only the feature data.frame
). The return value of cpo.retrafo
for a TOCPO must always be in the same format as the input target
value: a data.frame
with the manipulated target values when dataformat
is anything besides "task"
or "df.all"
, or a Task
or data.frame
of all data (with non-target columns unmodified) otherwise.
Inversion of predictions is performed using the functions cpo.train.invert
and cpo.invert
. cpo.train.invert
takes a data
and a control
argument, and any arguments declared in the par.set
. It is called whenever new data is fed into the CPO
or its retrafo CPOTrained
, and creates a CPOTrained
state that is used to invert the prediction done on this new data. The control
argument takes the value returned by the cpo.train
function upon initial training, and the data
argument is the new data for which to prepare the CPOTrained
inverter. It has the form dictated by dataformat
, with the exception that "task"
and "df.all"
dataformat
are handled as "df.feature"
; this is necessary since the new data could be a data.frame
of data with unknown target.
The following is an example of a TOCPO that trains a classification Learner
on a binary classification Task
and changes it to a Task
of whether or not the Learner
predicted the truth for a given data line correctly. (Real-world applications would probably need to take some precautions against overfitting.) In its cpo.train
step, the given Learner
is trained on the incoming data and the resulting WrappedModel
object is returned as the “control
” object. This is given to the cpo.retrafo
function, which performs prediction and creates a new classification Task
with the match / mismatch between model prediction and ground truth as target. When an external Learner
is trained on data that was preprocessed like this, its prediction will be whether the CPO
-internal Learner
can be trusted to predict a given data row. To “invert” this, i.e. to get the actual prediction, the cpo.invert
function needs to have the internal Learner
’s prediction as well as the prediction made by the external Learner
. The former is provided by cpo.train.invert
, which uses the WrappedModel
to make a prediction on the new data, and given as control.invert
to cpo.invert
. The latter is the target
data given to cpo.invert
. This example CPO
supports inverting both "response"
and "prob"
predict.type
predictions, as declared in the predict.type.map
argument. The actual predict.type
to invert is given to cpo.invert
as an argument.
= makeCPOTargetOp("xmp.meta", # nolint
xmpMetaLearn pSS(lrn: untyped),
dataformat = "task",
properties.target = c("classif", "twoclass"),
predict.type.map = c(response = "response", prob = "prob"),
cpo.train = function(data, target, lrn) {
cat("*** cpo.train ***\n")
= setPredictType(lrn, "prob")
lrn train(lrn, data)
},cpo.retrafo = function(data, target, control, lrn) {
cat("*** cpo.retrafo ***\n")
= predict(control, target)
prediction = getTaskTargetNames(target)
tname = getTaskData(target)
tdata = factor(prediction$data$response == prediction$data$truth)
tdata[[tname]] makeClassifTask(getTaskId(target), tdata, tname, positive = "TRUE",
fixup.data = "no", check.data = FALSE)
},cpo.train.invert = function(data, control, lrn) {
cat("*** cpo.train.invert ***\n")
predict(control, newdata = data)$data
},cpo.invert = function(target, control.invert, predict.type, lrn) {
cat("*** cpo.invert ***\n")
if (predict.type == "prob") {
= as.matrix(control.invert[grep("^prob\\.", names(control.invert))])
outmat = outmat[, c(2, 1)]
revmat * target[, "prob.TRUE", drop = TRUE] +
outmat * target[, "prob.FALSE", drop = TRUE]
revmat else {
} stopifnot(levels(target) == c("FALSE", "TRUE"))
= as.numeric(control.invert$response)
numeric.prediction = ifelse(target == "TRUE",
numeric.res
numeric.prediction,3 - numeric.prediction)
factor(levels(control.invert$response)[numeric.res],
levels(control.invert$response))
}
})
= xmpMetaLearn(makeLearner("classif.logreg")) cpo
To show the inner workings of this CPO
, the following example data is used.
set.seed(12)
= makeResampleInstance(hout, pid.task)
split = subsetTask(pid.task, split$train.inds[[1]])
train.task = subsetTask(pid.task, split$predict.inds[[1]]) test.task
It can be instructive to watch the cat()
output of this CPO
to see which function gets called at what point in the lifecycle. The cpo.train
function is called first to create the control
object. The Task
is transformed in cpo.retrafo
. Also cpo.train.invert
is called, since an inverter
attribute is attached to the returned trafo.
= train.task %>>% cpo
trafd attributes(trafd)
The values of the target column (“diabetes”) of the result can be compared with the prediction of a "classif.logreg"
Learner
on the same data:
head(getTaskData(trafd))
= train(makeLearner("classif.logreg", predict.type = "prob"), train.task)
model head(predict(model, train.task)$data[c("truth", "response")])
When new data is transformed using the retrafo CPOTrained
, another inverter
attribute is created, and hence cpo.train.invert
is called again. Since the target column of the test.task
in the following example is also transformed, the cpo.retrafo
function is called.
= test.task %>>% retrafo(trafd)
retr attributes(retr)
In a real world application, it would be possible for the new incoming data to have unknown target values. In that case, no target column would need to be changed, and cpo.retrafo
is not called. The resulting data, retr.df
, equals the input data with a retrafo
attribute added.
= getTaskData(test.task, target.extra = TRUE)$data %>>% retrafo(trafd)
retr.df names(attributes(retr.df))
The invert functionality can be demonstrated by making a prediction with an external model.
= train("classif.svm", trafd)
ext.model = predict(ext.model, retr)
ext.pred = invert(inverter(retr), ext.pred)
newpred performance(newpred)
It may also be instructive to attach the xmpMetaLearn
CPO
to a Learner
to see which functions get called during training and prediction of a TOCPO-Learner
. Since the Learner
does not do inversion of the training data, a CPOTrained
for inversion is not created during training, and cpo.train.invert
is hence not called. Only cpo.train
(for control object creation) and cpo.retrafo
(target value change) are called. During prediction, the input data is used to create an (internally used) inversion CPOTrained
which promptly gets used by the prediction made by "classif.svm"
. Hence both cpo.train.invert
and cpo.invert
are called in succession.
= cpo %>>% makeLearner("classif.svm")
cpo.learner = train(cpo.learner, train.task) cpo.model
= predict(cpo.model, test.task)
lrnpred performance(lrnpred)
See Postscriptum for an evaluation of xmpMeatLearn
’s performance.
Just like for FOCPOs, it is possible to create functional TOCPOs. In the case of makeCPOTargetOp()
, it is possible to have cpo.train
create cpo.retrafo
and cpo.train.invert
, instead of giving them to makeCPOTargetOp()
directly. Just as in makeCPO
, these functions can then access the state of their environment in the cpo.train
call and hence have neither a control
argument, nor any arguments for the par.set
parameters. Since cpo.train
must in this case create two functions, these functions only need to be defined within cpo.train
, the return value is ignored.
Note that cpo.retrafo
and cpo.train.invert
must either be both functional or both object based.
It is furthermore possible to return a cpo.invert
function by cpo.train.invert
, instead of giving it to makeCPOTargetOp()
. As above, the returned function should not have any parameters for the ones given in par.set
, and should not have a control.invert
. cpo.invert
can be functional or not, independently of whether cpo.retrafo
and cpo.train.invert
are functional.
As in makeCPO()
, all functions that are given functionally must be explicitly set to NULL
in the makeCPOTargetOp()
call.
The xmpMetaLearn
example above with functional cpo.retrafo
, cpo.train.invert
and cpo.invert
would look like the following:
= makeCPOTargetOp("xmp.meta.fnc", # nolint
xmpMetaLearn pSS(lrn: untyped),
dataformat = "task",
properties.target = c("classif", "twoclass"),
predict.type.map = c(response = "response", prob = "prob"),
# set the cpo.* parameters not needed to NULL:
cpo.retrafo = NULL, cpo.train.invert = NULL, cpo.invert = NULL,
cpo.train = function(data, target, lrn) {
cat("*** cpo.train ***\n")
= setPredictType(lrn, "prob")
lrn = train(lrn, data)
model = function(data, target) {
cpo.retrafo cat("*** cpo.retrafo ***\n")
= predict(model, target)
prediction = getTaskTargetNames(target)
tname = getTaskData(target)
tdata = factor(prediction$data$response == prediction$data$truth)
tdata[[tname]] makeClassifTask(getTaskId(target), tdata, tname, positive = "TRUE",
fixup.data = "no", check.data = FALSE)
}= function(data) {
cpo.train.invert cat("*** cpo.train.invert ***\n")
= predict(model, newdata = data)$data
prediction function(target, predict.type) { # this is returned as cpo.invert
cat("*** cpo.invert ***\n")
if (predict.type == "prob") {
= as.matrix(prediction[grep("^prob\\.", names(prediction))])
outmat = outmat[, c(2, 1)]
revmat * target[, "prob.TRUE", drop = TRUE] +
outmat * target[, "prob.FALSE", drop = TRUE]
revmat else {
} stopifnot(levels(target) == c("FALSE", "TRUE"))
= as.numeric(prediction$response)
numeric.prediction = ifelse(target == "TRUE",
numeric.res
numeric.prediction,3 - numeric.prediction)
factor(levels(prediction$response)[numeric.res],
levels(prediction$response))
}
}
} })
The example given above is a relatively elaborate TOCPO which needs information from the prediction data to perform inversion. Many simpler applications of target transformation do not need this information if their inversion step is independent of this data. It is possible to declare such a TOCPO using the constant.invert
flag in makeCPOTargetOp()
. If constant.invert
is set to TRUE
, the cpo.train.invert
argument must be explicitly set to NULL
. cpo.train
still needs to have a control.invert
argument; it is set to the value returned by cpo.train
.
The following example is a TOCPO for regression Task
s that centers target values during training. After prediction, the data is inverted by adding the original mean of the training data to the predictions. This inversion operation does not need any information about the prediction data going in, so the TOCPO can be declared constant.invert
.
The cpo.retrafo
function is also called when new prediction data with a target column is transformed (as during model validation). In that case, the mean of the training data column is subtracted. Therefore the mean generated by cpo.train
needs to be used in cpo.retrafo
(i.e. the control
value), not the mean of the target
data present.
= makeCPOTargetOp("xmp.center", # nolint
xmpRegCenter constant.invert = TRUE,
cpo.train.invert = NULL, # necessary for constant.invert = TRUE
dataformat = "df.feature",
properties.target = "regr",
cpo.train = function(data, target) {
# control value is just the mean of the target column
mean(target[[1]])
},cpo.retrafo = function(data, target, control) {
# subtract mean from target column in retrafo
1]] = target[[1]] - control
target[[
target
},cpo.invert = function(target, predict.type, control.invert) {
+ control.invert
target
})
= xmpRegCenter() cpo
To illustrate this CPO
, the following data is used:
= subsetTask(bh.task, 150:155)
train.task getTaskTargets(train.task)
= subsetTask(bh.task, 156:160)
predict.task getTaskTargets(predict.task)
The target column of the task after transformation has a mean of 0.
= train.task %>>% cpo
trafd getTaskTargets(trafd)
When applying the retrafo CPOTrained
to a new task, the mean of the training task target column is subtracted.
getTaskTargets(predict.task)
= retrafo(trafd)
retr = predict.task %>>% retr
predict.traf getTaskTargets(predict.traf)
When inverting a regression prediction, the mean of the training data target column is added to the prediction.
= train("regr.lm", trafd)
model = predict(model, predict.traf)
pred pred
invert(inverter(predict.traf), pred)
Since "regr.lm"
is translation invariant and deterministic, the prediction equals the prediction made without centering the target:
= train("regr.lm", train.task)
model predict(model, predict.task)
A special property of constant.invert
TOCPOs is that their retrafo CPOTrained
can also be used for inversion. This is the case since the tight coupling of inversion operation to the data used to create the prediction is not necessary when the inversion is actually independent of this data. This is indicated by getCPOTrainedCapability()
returning a vector with the "invert"
capability set to 1
. However, when using the retrafo CPOTrained
for inversion, the “truth” column is absent from the inverted prediction.
getCPOTrainedCapability(retr)
invert(retr, pred)
Just as above, constant.invert
TOCPOs can be functional. For this, the cpo.train
function must declare both a cpo.retrafo
and a cpo.invert
variable which perform the requested operations. These functions have no control
or control.invert
parameter, and no parameters pertaining to par.set
.
Very simple target column operations that operate on a row-by-row basis without needing information e.g. from training data, can be declared as “stateless”. Similarly to makeCPO()
, when cpo.train
parameter is set to NULL
, no control object is created for a CPOTrained
. Furthermore, a stateless TOCPO must always have constant.invert
set as well. Therefore, only cpo.retrafo
and cpo.invert
are given as functions, both without a control
or control.invert
argument. One example is a TOCPO that log-transforms the target column of a regression task, and exponentiates the predictions made from this during inversion. (A better inversion would take the "se"
prediction into account, see cpoLogTrafoRegr
.)
= makeCPOTargetOp("log.regr", # nolint
xmpLogRegr constant.invert = TRUE,
properties.target = "regr",
cpo.train = NULL, cpo.train.invert = NULL,
cpo.retrafo = function(data, target) {
1]] = log(target[[1]])
target[[
target
},cpo.invert = function(target, predict.type) {
exp(target)
})
= xmpLogRegr() cpo
The CPO
takes the logarithm of the task target column both during training and when using the retrafo CPOTrained
.
= train.task %>>% cpo
trafd getTaskTargets(trafd)
= retrafo(trafd)
retr = predict.task %>>% retr
predict.traf getTaskTargets(predict.traf)
= train("regr.lm", trafd)
model = predict(model, predict.traf)
pred pred
Note that both the inverter and the retrafo CPOTrained
can be used for inversion, since a stateless TOCPO also has constant.invert
set. As above, when using the retrafo CPOTrained
, the truth column is absent from the result.
invert(inverter(predict.traf), pred)
invert(retr, pred)
makeCPOExtendedTargetOp()
Just as for FOCPOs, it is possible to declare a TOCPO while having more direct control over what happens at which stage of training, re-transformation, or inversion. In a TOCPO defined with makeCPOTargetOp()
, the cpo.retrafo
and cpo.train.invert
functions are called automatically when necessary during training and re-transformation. makeCPOExtendedTargetOp()
instead has a cpo.trafo
and a cpo.retrafo
parameter, which get called during the respective operation.
cpo.trafo
must be a function taking the same parameters as cpo.train
in makeCPOTargetOp()
. Instead of returning a control object, it must define a variable named “control
”, and a variable named “control.invert
”. The former is used as the control
argument of cpo.retrafo
, the latter is used as control.invert
for cpo.invert
when using the inverter CPOTrained
created during training. The return value of cpo.trafo
must be similar to the value returned by cpo.retrafo
in makeCPOTargetOp()
: it must be the modified data set or target, depending on dataformat
.
cpo.retrafo
must take the same parameters as in makeCPOTargetOp()
. It must declare a control.invert
variable that will be given to cpo.retrafo
when using the inverter CPOTrained
created during retransformation. Since cpo.retrafo
is always called during retrafo CPOTrained
application, a “target” column may or may not be present. If a target column is not present, the target
parameter of cpo.retrafo
is NULL
and the return value of cpo.retrafo
is ignored; otherwise it must be the transformed target
value (which, as in makeCPOTargetOp()
, can be a Task
or data.frame
of all data if dataformat
is "task"
or "df.all"
).
cpo.invert
works just as in makeCPOTargetOp()
.
The following is a nonsensical, synthetic example that adds 1
to the target column of a regression Task
during initial training, subtracts 1
during retrafo re-application and is a no-op during inversion.
= makeCPOExtendedTargetOp("syn.cpo", # nolint
xmpSynCPO properties.target = "regr",
cpo.trafo = function(data, target) {
cat("*** cpo.trafo ***\n")
1]] = target[[1]] + 1
target[[= "control created in cpo.trafo"
control = "control.invert created in cpo.trafo"
control.invert
target
},cpo.retrafo = function(data, target, control) {
cat("*** cpo.retrafo ***", "control is:", deparse(control), sep = "\n")
= "control.invert created in cpo.retrafo"
control.invert if (!is.null(target)) {
cat("target is non-NULL, performing transformation\n")
1]] = target[[1]] - 1
target[[return(target)
else {
} cat("target is NULL, no transformation (but control.invert was created)\n")
return(NULL) # is ignored.
}
},cpo.invert = function(target, control.invert, predict.type) {
cat("*** invert ***", "control.invert is:", deparse(control.invert),
sep = "\n")
target
})
= xmpSynCPO() cpo
For an “extended” TOCPO, only one of the transformation functions is called in each invocation. Initial transformation calls cpo.trafo
and adds 1
to the targets; using the CPOTrained
for re-transformation calls cpo.retrafo
and subtracts 1
.
= train.task %>>% cpo
trafd getTaskTargets(trafd)
= train.task %>>% retrafo(trafd) retrafd
getTaskTargets(retrafd)
It is also possible to perform re-transformation with a data.frame
that does not include the target column. In that case the target
value given to cpo.retrafo
will be NULL
, as reported by that function in this example:
= getTaskData(train.task, target.extra = TRUE)$data %>>% retrafo(trafd) retrafd
The trafd
object has an inverter CPOTrained
attribute that was created by cpo.trafo
, the retrafd
object has an inverter CPOTrained
attribute created by cpo.retrafo
(necessarily). This is made visible by the given example inverter function:
= invert(inverter(trafd), 1:6) inv
= invert(inverter(retrafd), 1:6) inv
As an aside, the Learner
enhanced by xmpMetaLearn
seems to perform marginally better than either "classif.svm"
or "classif.logreg"
on their own for a large enough subset of pid.task
(here resampled with output suppressed).
= list(
learners logreg = makeLearner("classif.logreg"),
svm = makeLearner("classif.svm"),
cpo = xmpMetaLearn(makeLearner("classif.logreg")) %>>%
makeLearner("classif.svm")
)
# suppress output of '*** cpo.train ***' etc.
configureMlr(show.info = FALSE, show.learner.output = FALSE)
= sapply(learners, function(lrn) {
perfs unname(replicate(20, resample(lrn, pid.task, cv10)$aggr))
})
# reset mlr settings
configureMlr()
boxplot(perfs)
P-Values of comparing the CPOLearner
to both "classif.logreg"
, and "classif.svm"
:
= c(
pvals logreg = t.test(perfs[, "logreg"], perfs[, "cpo"], "greater")$p.value,
svm = t.test(perfs[, "svm"], perfs[, "cpo"], "greater")$p.value
)
round(p.adjust(pvals), 3)