Imputing newdata with a saved mixgb imputer

Impute new unseen data using a saved imputer object

First we can split the nhanes3_newborn dataset into training data and test data.

library(mixgb)
set.seed(2022)
n <- nrow(nhanes3_newborn)
idx <- sample(1:n, size = round(0.7 * n), replace = FALSE)
train.data <- nhanes3_newborn[idx, ]
test.data <- nhanes3_newborn[-idx, ]

We can use the training data to obtain m imputed datasets and save their imputation models. To achieve this, users need to set save.models = TRUE. By default save.vars = NULL, imputation models for variables with missing data in the training data will be saved. However, the unseen data may also have missing values in other variables. Users can be comprehensive by saving models for all variables by setting save.vars = colnames(train.data). Note that this would take much longer as we need to train and save a model for each variable. If users are confident that only certain variables will have missing values in the new data, we recommend specifying the names or indices of these variables in save.vars instead of saving models for all variables.

# obtain m imputed datasets for train.data and save imputation models
mixgb.obj <- mixgb(data = train.data, m = 5, save.models = TRUE, save.vars = NULL)

When save.models = TRUE, mixgb() will return an object containing the following:

imputed.data: a list of m imputed dataset for training data
XGB.models: a list of m sets of XGBoost models for variables specified in save.vars.
params: a list of parameters that are required for imputing new data using impute_new() later on.

We can extract m imputed datasets from the saved imputer object by $imputed.data.

train.imputed <- mixgb.obj$imputed.data
# the 5th imputed dataset
head(train.imputed[[5]])
#>    HSHSIZER HSAGEIR HSSEX DMARACER DMAETHNR DMARETHN BMPHEAD BMPRECUM BMPSB1
#> 1:        7       2     1        1        1        3    43.0     67.1    9.2
#> 2:        4       3     2        2        3        2    42.6     67.1    8.8
#> 3:        3       9     2        2        3        2    46.5     64.3    8.6
#> 4:        3       9     2        1        3        1    46.2     68.5   10.8
#> 5:        5       4     1        1        3        1    44.7     63.0    6.0
#> 6:        5      10     1        1        3        1    45.2     72.0    5.4
#>    BMPSB2 BMPTR1 BMPTR2 BMPWT DMPPIR HFF1 HYD1
#> 1:    8.5    8.8    8.8  7.80  1.701    2    1
#> 2:    8.8   13.3   12.2  8.70  0.102    2    1
#> 3:    8.0   10.4    9.2  8.00  0.359    1    3
#> 4:   10.0   16.6   16.0  8.98  0.561    1    3
#> 5:    5.8    9.0    9.0  7.60  2.379    2    1
#> 6:    5.4    9.2    9.4  9.00  2.173    2    2

To impute new data with this saved imputer object, we use the impute_new() function. User can also specify whether to use new data for initial imputation. By default, initial.newdata = FALSE, we will use the information of training data to initially impute the new data. New data will be imputed with the saved models. This process will be considerably faster as we don’t need to build the imputation models again.

test.imputed <- impute_new(object = mixgb.obj, newdata = test.data)

If PMM is used when we call mixgb(), predicted values of missing entries in the new dataset are matched with donors from training data. Users can also set the number of donors for PMM when imputing new data. By default, pmm.k = NULL , which means the same setting as the training object will be used.

Similarly, users can set the number of imputed datasets m. Note that this value has to be smaller than or equal to the m in mixgb(). If it is not specified, it will use the same m value as the saved object.

test.imputed <- impute_new(object = mixgb.obj, newdata = test.data, initial.newdata = FALSE, pmm.k = 3, m = 4)

Imputing newdata with a saved mixgb imputer

Yongshi Deng

2022-06-07

Impute new unseen data using a saved imputer object