This vignette is intended to provide a first introduction to the R package mitml
for generating and analyzing multiple imputations for multilevel missing data. A usual application of the package may consist of the following steps.
The mitml
package offers a set of tools to facilitate each of these steps. This vignette is intended as a step-by-step illustration of the basic features of mitml
. Further information can be found in the other vignettes and the package documentation.
For the purposes of this vignette, we employ a simple example that makes use of the studentratings
data set, which is provided with mitml
. To use it, the mitml
package and the data set must be loaded as follows.
library(mitml)
data(studentratings)
More information about the variables in the data set can be obtained from its summary
.
summary(studentratings)
# ID FedState Sex MathAchiev MathDis
# Min. :1001 B :375 Length:750 Min. :225.0 Min. :0.2987
# 1st Qu.:1013 SH:375 Class :character 1st Qu.:440.7 1st Qu.:1.9594
# Median :1513 Mode :character Median :492.7 Median :2.4350
# Mean :1513 Mean :495.4 Mean :2.4717
# 3rd Qu.:2013 3rd Qu.:553.2 3rd Qu.:3.0113
# Max. :2025 Max. :808.1 Max. :4.7888
# NA's :132 NA's :466
# SES ReadAchiev ReadDis CognAbility SchClimate
# Min. :-9.00 Min. :191.1 Min. :0.7637 Min. :28.89 Min. :0.02449
# 1st Qu.:35.00 1st Qu.:427.4 1st Qu.:2.1249 1st Qu.:43.80 1st Qu.:1.15338
# Median :46.00 Median :490.2 Median :2.5300 Median :48.69 Median :1.65636
# Mean :46.55 Mean :489.9 Mean :2.5899 Mean :48.82 Mean :1.73196
# 3rd Qu.:59.00 3rd Qu.:558.4 3rd Qu.:3.0663 3rd Qu.:53.94 3rd Qu.:2.24018
# Max. :93.00 Max. :818.5 Max. :4.8554 Max. :71.29 Max. :4.19316
# NA's :281 NA's :153 NA's :140
In addition, the correlations between variables (based on pairwise observations) may be useful for identifying possible sources of information that may be used during the treatment of missing data.
# MathAchiev MathDis SES ReadAchiev ReadDis CognAbility SchClimate
# MathAchiev 1.000 -0.106 0.260 0.497 -0.080 0.569 -0.206
# MathDis -0.106 1.000 -0.206 -0.189 0.613 -0.203 0.412
# SES 0.260 -0.206 1.000 0.305 -0.153 0.138 -0.176
# ReadAchiev 0.497 -0.189 0.305 1.000 -0.297 0.413 -0.320
# ReadDis -0.080 0.613 -0.153 -0.297 1.000 -0.162 0.417
# CognAbility 0.569 -0.203 0.138 0.413 -0.162 1.000 -0.266
# SchClimate -0.206 0.412 -0.176 -0.320 0.417 -0.266 1.000
This illustrates that (a) most variables in the data set are affected by missing data, but also (b) that substantial relations exist between variables. For simplicity, we focus on only a subset of these variables.
For the present example, we focus on the two variables ReadDis
(disciplinary problems in reading class) and ReadAchiev
(reading achievement).
Assume we are interested in the relation between these variables. Specifically, we may be interested in the following analysis model
\[ \mathit{ReadAchiev}_{ij} = \gamma_{00} + \gamma_{10} \mathit{ReadDis}_{ij} + u_{0j} + e_{ij} \]
On the basis of the syntax used in the R package lme4
, this model may be written as follows.
~ 1 + ReadDis + (1|ID) ReadAchiev
In this model, the relation between ReadDis
and ReadAchiev
is represented by a single fixed effect of ReadDis
, and a random intercept is included to account for the clustered structure of the data and the group-level variance in ReadAchiev
that is not explained by ReadDis
.
The mitml
package includes wrapper functions for the R packages pan
(panImpute
) and jomo
(jomoImpute
). Here, we will use the first option. To generate imputations with panImpute
, the user must specify (at least):
The easiest way of specifying the imputation model is to use the formula
argument of panImpute
. Generally speaking, the imputation model should include all variables that are either (a) part of the model of interest, (b) related to the variables in the model, or (c) related to whether the variables in the model are missing.
In this simple example, we include only ReadDis
and ReadAchiev
as the main target variables and SchClimate
as an auxiliary variable.
<- ReadAchiev + ReadDis + SchClimate ~ 1 + (1|ID) fml
Note that, in this specification of the imputation model. all variables are included on the left-hand side of the model, whereas the right-hand side is left “empty”. This model allows for all relations between variables at Level 1 and 2 and is thus suitable for most applications of the multilevel random intercept model (for further discussion, see also Grund, Lüdtke, & Robitzsch, 2016, in press).
The imputation procedure is then run for 5,000 iterations (burn-in), after which 100 imputations are drawn every 100 iterations.
<- panImpute(studentratings, formula = fml, n.burn = 5000, n.iter = 100, m = 100) imp
This step may take a few seconds. Once the process is completed, the imputations are saved in the imp
object.
In mitml
, there are two options for assessing the convergence of the imputation procedure. First, the summary
calculates the “potential scale reduction factor” (\(\hat{R}\)) for each parameter in the imputation model. If this value is noticeably larger than 1 for some parameters (say \(>1.05\)), a longer burn-in period may be required.
summary(imp)
#
# Call:
#
# panImpute(data = studentratings, formula = fml, n.burn = 5000,
# n.iter = 100, m = 100)
#
# Cluster variable: ID
# Target variables: ReadAchiev ReadDis SchClimate
# Fixed effect predictors: (Intercept)
# Random effect predictors: (Intercept)
#
# Performed 5000 burn-in iterations, and generated 100 imputed data sets,
# each 100 iterations apart.
#
# Potential scale reduction (Rhat, imputation phase):
#
# Min 25% Mean Median 75% Max
# Beta: 1.000 1.001 1.001 1.001 1.002 1.003
# Psi: 1.000 1.001 1.001 1.001 1.001 1.002
# Sigma: 1.000 1.000 1.000 1.000 1.000 1.001
#
# Largest potential scale reduction:
# Beta: [1,3], Psi: [2,1], Sigma: [2,1]
#
# Missing data per variable:
# ID ReadAchiev ReadDis SchClimate FedState Sex MathAchiev MathDis SES CognAbility
# MD% 0 0 20.4 18.7 0 0 17.6 62.1 37.5 0
Second, diagnostic plots can be requested with the plot
function. These plots consist of a trace plot, an autocorrelation plot, and some additional information about the posterior distribution. Convergence can be assumed if the trace plot is stationary (i.e., does not “drift”), and the autocorrelation is within reasonable bounds for the chosen number of iterations between imputations.
For this example, we examine only the plot for the parameter Beta[1,2]
(i.e., the intercept of ReadDis
).
plot(imp, trace = "all", print = "beta", pos = c(1,2))
Taken together, both \(\hat{R}\) and the diagnostic plots indicate that the imputation model converged, setting the basis for the analysis of the imputed data sets.
In order to work with and analyze the imputed data sets, the data sets must be completed with the imputations generated in the previous steps. To do so, mitml
provides the function mitmlComplete
.
<- mitmlComplete(imp, "all") implist
This resulting object is a list that contains the 100 completed data sets.
In order to obtain estimates for the model of interest, the model must be fit separately to each of the completed data sets, and the results must be pooled into a final set of estimates and inferences. The mitml
package offers the with
function to fit various statistical models to a list of completed data sets.
In this example, we use the lmer
function from the R package lme4
to fit the model of interest.
library(lme4)
<- with(implist, lmer(ReadAchiev ~ 1 + ReadDis + (1|ID))) fit
The resulting object is a list containing the 100 fitted models. To pool the results of these models into a set of final estimates and inferences, mitml
offers the testEstimates
function.
testEstimates(fit, extra.pars = TRUE)
#
# Call:
#
# testEstimates(model = fit, extra.pars = TRUE)
#
# Final parameter estimates and inferences obtained from 100 imputed data sets.
#
# Estimate Std.Error t.value df P(>|t|) RIV FMI
# (Intercept) 582.186 14.501 40.147 4335.314 0.000 0.178 0.152
# ReadDis -35.689 5.231 -6.822 3239.411 0.000 0.212 0.175
#
# Estimate
# Intercept~~Intercept|ID 902.868
# Residual~~Residual 6996.303
# ICC|ID 0.114
#
# Unadjusted hypothesis test as appropriate in larger samples.
The estimates can be interpreted in a manner similar to the estimates from the corresponding complete-data procedure. In addition, the output includes diagnostic quantities such as the fraction of missing information (FMI), which can be helpful for interpreting the results and understanding problems with the imputation procedure.
Grund, S., Lüdtke, O., & Robitzsch, A. (2016). Multiple imputation of multilevel missing data: An introduction to the R package pan. SAGE Open, 6(4), 1–17. doi: 10.1177/2158244016668220 (Link)
Grund, S., Lüdtke, O., & Robitzsch, A. (in press). Multiple imputation of missing data for multilevel models: Simulations and recommendations. Organizational Research Methods. doi: 10.1177/1094428117703686 (Link)
# Author: Simon Grund (grund@ipn.uni-kiel.de)
# Date: 2021-10-05