NADIA implements imputation at the form of mlr3pipeline functions, with a structure like this:
Package name can be replaced by the name of the simple method like Sample. Method names can also be the labels of the approach for example Sample_B or Mice_A. This simple convention is implemented by all PipeOp functions in NADIA and simplify access to advance imputation methods even more.
As I mention in NADIA examples and motivation NADIA mainly use the B approach. Presented in the diagram below.
Because of that, data can often prove to be problematic. Imputation can fail because of one of the below problems:
Because of those and other possible problems. We performed tests to check how often errors appear depends on the package.
Tests were performed using data from openML because this assure variety in data sets. We used methods already implemented in mlr3pipelines to remove constant variables and categorical features containing too many unique classes. Data sets were tested automatically that’s mean no individual preprocessing. Lack of special preprocessing leads to low statistics in the case of Amelia because this package often suffers because of highly correlated variables. We define success as a situation when ALL missing data in the dataset was imputed. In tests, we didn’t check the quality of the imputation. The result is presented in the table below:
Package_method | Succesful tasks | Percent of succesful tasks |
---|---|---|
Amelia | 6/25 | 24% |
mice | 16/25 | 64% |
missForest | 21/25 | 84% |
missMDA_MCA_PCA_FMAD | 11/25 | 44% |
missMDA_MFA | 12/25 | 48% |
missRanger | 21/25 | 84% |
softImpute | 20/25 | 80% |
VIM_HD | 22/25 | 88% |
VIM_IRMI | 8/25 | 32% |
VIM_kNN | 22/25 | 88% |
VIM_regrImp | 13/25 | 52% |
These results can look unappealing, but data weren’t treated individually. For example, removing highly correlated variables should significantly improve results with weaker packages.
In the previous section we show that errors are something that should be considered important for imputation. Luckily mlr3pipelines implements a method of handling them. All types of Learners have a field call encapsulate responsible for this. More about how its work in examples below:
Evaluate package allows the user to handle accruing errors in the current R session. For example, if we use cross-validation it can be understood like this. Every fold is running in a separate try-catch. It’s not how it works from a technical perspective, but can be understood like this.
A quick example of using evaluate:
# Encaplustion with Evalute
Learner$encapsulate=c(train="evaluate",predict="evaluate")
# Resampling with errors and presenting errors
resample(tsk("pima"),Learner,rsmp("cv",folds=5))$errors
## INFO [19:25:30.372] Applying learner 'impute_mice_B.classif.debug' on task 'pima' (iter 1/5)
## INFO [19:25:30.806] Applying learner 'impute_mice_B.classif.debug' on task 'pima' (iter 5/5)
## INFO [19:25:31.160] Applying learner 'impute_mice_B.classif.debug' on task 'pima' (iter 2/5)
## INFO [19:25:31.500] Applying learner 'impute_mice_B.classif.debug' on task 'pima' (iter 4/5)
## INFO [19:25:31.845] Applying learner 'impute_mice_B.classif.debug' on task 'pima' (iter 3/5)
## iteration msg
## 1: 1 Error from classif.debug->train()
## 2: 2 Error from classif.debug->train()
## 3: 3 Error from classif.debug->train()
## 4: 4 Error from classif.debug->train()
## 5: 5 Error from classif.debug->train()
Callr package allows you to run every fold in a separate session. This can be used exactly like evaluate but in some cases is more powerful. For large data frames (over 100000 rows) packages like mice and Amelia can sometimes crash the entire R session. In that situation, tryCatch or evaluate isn’t enough and you need to use caller. It can be quite tricky to correctly pass seeds to callr session in detail is best to check mlr3 book.
This is how callr can be used (in this case we don’t simulate the session crash because it will be hard to achieve reliably on every machine): 2
# encaplustion with callr
Learner$encapsulate=c(train="callr",predict="callr")
# Resampling with errors and presenting errors
resample(tsk("pima"),Learner,rsmp("cv",folds=5))$errors
## INFO [19:25:32.991] Applying learner 'impute_mice_B.classif.debug' on task 'pima' (iter 2/5)
## INFO [19:25:36.084] Applying learner 'impute_mice_B.classif.debug' on task 'pima' (iter 3/5)
## INFO [19:25:39.077] Applying learner 'impute_mice_B.classif.debug' on task 'pima' (iter 4/5)
## INFO [19:25:42.075] Applying learner 'impute_mice_B.classif.debug' on task 'pima' (iter 1/5)
## INFO [19:25:44.980] Applying learner 'impute_mice_B.classif.debug' on task 'pima' (iter 5/5)
## iteration msg
## 1: 1 Error from classif.debug->train()
## 2: 2 Error from classif.debug->train()
## 3: 3 Error from classif.debug->train()
## 4: 4 Error from classif.debug->train()
## 5: 5 Error from classif.debug->train()
It also worth to see the time difference between the two packages :
The difference may look huge, but in the case of a larger data set (when we mainly want to use callr) value of the difference stays approximately the same and starting to be irrelevant when imputation takes for example 20h.
Errors will appear in the case of using statistical methods on real data. NADIA, implements the best possible methods to handle them, but in the end. The best way to solve any issue is individual approach for each data set, maybe removing irrelevant columns or scaling data. It all depends on a used package and data structure.