Mixgb offers a scalable solution for imputing large datasets using XGBoost, bootstrapping and predictive mean matching. Mixgb is built under Fully Conditional Specification (FCS), where XGBoost imputation models are built for each incomplete variable. Mixgb can automatically handle different types of variables and users do not need to encode categorical variables themselves. Users can also choose different settings regarding bootstrapping and the types predictive mean matching to enhance imputation performance.
mixgb
We first load the mixgb
package and the
nhanes3_newborn
dataset, which contains 16 variables of
various types (integer/numeric/factor/ordinal factor). There are 9
variables with missing values.
library(mixgb)
str(nhanes3_newborn)
#> tibble [2,107 × 16] (S3: tbl_df/tbl/data.frame)
#> $ HSHSIZER: int [1:2107] 4 3 5 4 4 3 5 3 3 3 ...
#> $ HSAGEIR : int [1:2107] 2 5 10 10 8 3 10 7 2 7 ...
#> $ HSSEX : Factor w/ 2 levels "1","2": 2 1 2 2 1 1 2 2 2 1 ...
#> $ DMARACER: Factor w/ 3 levels "1","2","3": 1 1 2 1 1 1 2 1 2 2 ...
#> $ DMAETHNR: Factor w/ 3 levels "1","2","3": 3 1 3 3 3 3 3 3 3 3 ...
#> $ DMARETHN: Factor w/ 4 levels "1","2","3","4": 1 3 2 1 1 1 2 1 2 2 ...
#> $ BMPHEAD : num [1:2107] 39.3 45.4 43.9 45.8 44.9 42.2 45.8 NA 40.2 44.5 ...
#> ..- attr(*, "label")= chr "Head circumference (cm)"
#> $ BMPRECUM: num [1:2107] 59.5 69.2 69.8 73.8 69 61.7 74.8 NA 64.5 70.2 ...
#> ..- attr(*, "label")= chr "Recumbent length (cm)"
#> $ BMPSB1 : num [1:2107] 8.2 13 6 8 8.2 9.4 5.2 NA 7 5.9 ...
#> ..- attr(*, "label")= chr "First subscapular skinfold (mm)"
#> $ BMPSB2 : num [1:2107] 8 13 5.6 10 7.8 8.4 5.2 NA 7 5.4 ...
#> ..- attr(*, "label")= chr "Second subscapular skinfold (mm)"
#> $ BMPTR1 : num [1:2107] 9 15.6 7 16.4 9.8 9.6 5.8 NA 11 6.8 ...
#> ..- attr(*, "label")= chr "First triceps skinfold (mm)"
#> $ BMPTR2 : num [1:2107] 9.4 14 8.2 12 8.8 8.2 6.6 NA 10.9 7.6 ...
#> ..- attr(*, "label")= chr "Second triceps skinfold (mm)"
#> $ BMPWT : num [1:2107] 6.35 9.45 7.15 10.7 9.35 7.15 8.35 NA 7.35 8.65 ...
#> ..- attr(*, "label")= chr "Weight (kg)"
#> $ DMPPIR : num [1:2107] 3.186 1.269 0.416 2.063 1.464 ...
#> ..- attr(*, "label")= chr "Poverty income ratio"
#> $ HFF1 : Factor w/ 2 levels "1","2": 2 2 1 1 1 2 2 1 2 1 ...
#> $ HYD1 : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 1 3 1 1 1 1 1 1 2 1 ...
colSums(is.na(nhanes3_newborn))
#> HSHSIZER HSAGEIR HSSEX DMARACER DMAETHNR DMARETHN BMPHEAD BMPRECUM
#> 0 0 0 0 0 0 124 114
#> BMPSB1 BMPSB2 BMPTR1 BMPTR2 BMPWT DMPPIR HFF1 HYD1
#> 161 169 124 167 117 192 7 0
To impute this dataset, we can use the default settings. The default
number of imputed datasets m = 5
. Note that we do not need
to convert our data into dgCMatrix or one-hot coding format. Our package
will convert it automatically. Variables should be of the following
types: numeric, integer, factor or ordinal factor.
# use mixgb with default settings
<- mixgb(data = nhanes3_newborn, m = 5) imputed.data
We can also customise imputation settings:
The number of imputed datasets m
The number of imputation iterations maxit
Whether to convert ordinal factors to integer (imputation process
will be faster) ordinalAsInteger
Whether to use bootstrapping bootstrap
Predictive mean matching settings pmm.type
,
pmm.k
and pmm.link
.
Initial imputation methods for different types of variables
initial.num
, initial.int
and
initial.fac
.
Whether to save models for imputing newdata
save.models
and save.vars
.
XGBoost hyperparameters and verbose settings.
xgb.params
, nrounds
,
early_stopping_rounds
, print_every_n
and
verbose
.
# Use mixgb with chosen settings
<- list(
params max_depth = 6,
gamma = 0.1,
eta = 0.3,
min_child_weight = 1,
subsample = 1,
colsample_bytree = 1,
colsample_bylevel = 1,
colsample_bynode = 1,
nthread = 4,
tree_method = "auto",
gpu_id = 0,
predictor = "auto"
)
<- mixgb(
imputed.data data = nhanes3_newborn, m = 5, maxit = 1,
ordinalAsInteger = TRUE, bootstrap = TRUE,
pmm.type = "auto", pmm.k = 5, pmm.link = "prob",
initial.num = "normal", initial.int = "mode", initial.fac = "mode",
save.models = FALSE, save.vars = NULL,
xgb.params = params, nrounds = 50, early_stopping_rounds = 10, print_every_n = 10L, verbose = 0
)
Imputation performance can be affected by the hyperparameter
settings. It may seem daunting to tune a large set of hyperparameters,
but often we can narrow down the search as many hyperparameters are
correlated. In our package, we have a function mixgb_cv()
to tune nrounds
. There is no default nrounds
value in XGBoost,
so we need to specify it. The default
nrounds
in mixgb
is 50. However, we recommend
using mixgb_cv()
to find the optimal nrounds
first.
<- mixgb_cv(data = nhanes3_newborn, verbose = FALSE)
cv.results $evaluation.log
cv.results#> iter train_logloss_mean train_logloss_std test_logloss_mean
#> 1: 1 0.6208135 0.001946554 0.6552126
#> 2: 2 0.5767785 0.004676169 0.6374645
#> 3: 3 0.5449469 0.005228210 0.6272369
#> 4: 4 0.5191374 0.008095561 0.6242182
#> 5: 5 0.4941135 0.008357847 0.6243667
#> 6: 6 0.4755729 0.009002235 0.6254604
#> 7: 7 0.4595016 0.007635777 0.6274494
#> 8: 8 0.4438848 0.007674532 0.6277083
#> 9: 9 0.4337404 0.008486617 0.6281046
#> 10: 10 0.4190590 0.008824536 0.6305885
#> 11: 11 0.4102110 0.005878129 0.6333391
#> 12: 12 0.3978915 0.007290803 0.6358570
#> 13: 13 0.3889896 0.009138579 0.6394993
#> 14: 14 0.3781350 0.007533795 0.6391836
#> test_logloss_std
#> 1: 0.005066467
#> 2: 0.001808090
#> 3: 0.005509982
#> 4: 0.005613646
#> 5: 0.007161814
#> 6: 0.008257190
#> 7: 0.011842903
#> 8: 0.014264475
#> 9: 0.014754702
#> 10: 0.013225266
#> 11: 0.010154894
#> 12: 0.010633046
#> 13: 0.009992737
#> 14: 0.009951158
$response
cv.results#> [1] "HFF1"
$best.nrounds
cv.results#> [1] 4
By default, mixgb_cv()
will randomly choose an
incomplete variable as the response and build an XGBoost model with
other variables using the complete cases of the dataset. Therefore, each
run of mixgb_cv()
is likely to return different results.
Users can also specify the response and covariates in the argument
response
and select_features
,
respectively.
<- mixgb_cv(data = nhanes3_newborn, nfold = 10, nrounds = 100, early_stopping_rounds = 1,
cv.results response = "BMPHEAD", select_features = c("HSAGEIR", "HSSEX", "DMARETHN", "BMPRECUM", "BMPSB1", "BMPSB2", "BMPTR1", "BMPTR2", "BMPWT"), verbose = FALSE)
$best.nrounds
cv.results#> [1] 19
Since using mixgb_cv()
with this dataset mostly returns
a number less than 20
, I’ll set nrounds = 20
in mixgb()
to obtain m
imputed datasets.
<- mixgb(data = nhanes3_newborn, m = 5, nrounds = 20) imputed.data
If our dataset has ordinal factors, then the imputation process will
be sped up by converting them to integers. In
nhanes3_newborn
data, there is only one ordinal factor, but
we can still see that the imputation time is smaller with
ordinalAsInteger = TRUE
. If m
and
maxit
are larger, the difference will be even more obvious.
Note that when ordinalAsInteger = TRUE
, the class of these
ordinal factors in the imputed datasets will also be integer. We
recommend to use the default PMM setting pmm.type = "auto"
,
which will impute numeric and integer variables with PMM (type 2) and
impute categorical variables without PMM. If we don’t apply PMM to
integer variables, the missing values may be imputed as numeric.
Sometimes this can be problematic, especially when the integers are
actually converted from ordinal factors. Therefore,
pmm.type = NULL
is disabled when
ordinalAsInteger = TRUE
.
system.time(imputed.data <- mixgb(data = nhanes3_newborn, m = 5, maxit = 1, ordinalAsInteger = FALSE))
#> user system elapsed
#> 10.626 0.235 10.907
system.time(imputed.data <- mixgb(data = nhanes3_newborn, m = 5, maxit = 1, ordinalAsInteger = TRUE))
#> user system elapsed
#> 10.532 0.328 11.148
The mixgb
package provides the following visual
diagnostics functions:
Single variable: plot_hist()
,
plot_box()
, plot_bar()
;
Two variables: plot_2num()
,
plot_2fac()
, plot_1num1fac()
;
Three variables: plot_2num1fac()
,
plot_1num2fac()
.
Each function will return m+1
panels to compare the
observed data with m
sets of actual imputed values.
For more details, please check the vignette on githubVisual diagnostics for multiply imputed values.