Missing data are a common issue in many fields of empirical research. An popular approach to handling missing data is the method of multiple imputation (MI). Multiple imputation involves replacing missing values by a number of imputations, creating multiple imputed datasets. Each completed dataset is then analysed as usual, and estimates and standard errors are combined across imputations using rules developed by Rubin.
The most popular approach to imputation uses parametric models for the missing variables given the observed. Multiple imputation gives valid inferences provided that the missing data satisfy the so called missing at random (MAR) assumption and that the imputation models used are correctly specified.
When multiple variables are affected by missingness, the traditional approach to imputation is to specify a joint (or multivariate model) for the partially observed variables. One of the earliest examples of this was MI using the multivariate normal model. Rather than specifying a joint model directly, a popular alternative is the fully conditional specification (FCS), or chained equations approach. In FCS MI separate conditional models are specified for each partially observed variable. In each of these conditional models, by default all of the variables serve as predictors. For an overview of the FCS MI and an implementation of it in R, see van Buuren and Groothuis-Oudshoorn.
When missing values are imputed from a misspecified model, in general invalid inferences will result. One way in which misspecification can occur is when the imputation and substantive (analysis) model of interest are incompatible. Loosely speaking, this means there exists no joint model which contains the imputation model and the substantive model as the corresponding conditionals. In this case, as described by Bartlett et al (2015), assuming that the substantive model is correctly specified, unless the imputation and substantive models can be made compatible by imposing a restriction on the imputation model, incompatibility implies the imputation model is misspecified.
Such incompatibility between the imputation model used to impute a partially observed covariate and the substantive/outcome model can arise for example when the latter includes interactions or non-linear effects of variables. A further example is when the substantive model is a Cox proportional hazards model for a censored time to event outcome. In these cases, it may be difficult or impossible to specify an imputation model for a covariate which is compatible with the model for the outcome (the substantive model) using standard imputation models as available in existing packages.
The substantive model compatible modification of FCS MI (SMC-FCS),
proposed by Bartlett
et al (2015), ensures that each partially observed variable
is imputed from an imputation model which is compatible with a user
specified model for the outcome (which is typically the substantive
model of interest, although see below regarding auxiliary variables). As
described in further detail in the linked paper, for each partially
observed variable, e.g. x1
, in SMC-FCS a model is specified
for the conditional distribution of x1
given the other
partially observed variables x2,x3,..,xp
and fully observed
covariates z
. This, together with the specified substantive
model (a model for the outcome y
) defines an imputation
model for x1
which is guaranteed to be compatible with this
specified substantive model.
Unfortunately, the resulting imputation model for each partially
observed variable generally does not belong to a standard parametric
family, complicating the imputation of missing values. To overcome this,
smcfcs
uses the method of rejection sampling, which is more
computationally intensive than direct sampling methods.
SMC-FCS ensures compatibility between each partially observed covariate’s imputation model with the substantive model. However, when there is more than one partially observed variable, it does not guarantee that the corresponding different imputation models are mutually compatible. Consequently, as described further by Bartlett et al (2015), only in special cases does SMC-FCS generate imputations from a well defined Bayesian joint model. Nonetheless, by ensuring compatibility between each partially observed variable’s imputation model and the substantive model, it arguably overcomes (compared to standard FCS MI) the type of model incompatibility which is most likely to adversely affect inferences.
In certain situations it may be advantageous to use SMC-FCS rather than traditional FCS MI. Important examples, as mentioned previously, include situations where the substantive (outcome) model includes interactions or non-linear effects of some of the covariates, or where the outcome model is itself non-linear, such as a Cox proportional hazards model. See Bartlett et al (2015) for simulation results comparing the two approaches in these situations.
smcfcs
packageThe smcfcs
function in the smcfcs
package
implements the SMC-FCS procedure. Currently linear, logistic and Cox
proportional hazards substantive models. Competing risks outcome data
can also be accommodated, with a Cox proportional hazards model used to
model each cause specific hazard function. Partially observed variables
can be imputed using normal linear regression, logistic regression (for
binary variables), proportional odds regression (sometimes known as
ordinal logistic regression, suitable for ordered categorical
variables), multinomial logistic regression (for unordered categorical
variables), and Poisson regression (for count variables). In the
following we describe some of the important aspects of using
smcfcs
by way of an example data frame.
To illustrate the package, we use the simple example data frame
ex_linquad
, which is included with the package. This data
frame was simulated for n=1000
independent rows. For each
row, variables y,x,z,v
were intended to be collected, but
there are missing values in x
. The values have been made
artificially missing, with the probability of missingness dependent on
(the fully observed) y
variable. Below the first 10 rows of
the data frame are shown:
library(smcfcs)
1:10,] ex_linquad[
## y z x xsq v
## 1 -0.3404639 -1.2053334 -1.2070657 1.45700772 -2.18088437
## 2 2.1699185 0.3014667 0.2774292 0.07696698 0.17779805
## 3 2.0293128 -1.5391452 1.0844412 1.17601267 0.97370618
## 4 6.6311247 0.6353707 -2.3456977 5.50229771 -1.15350311
## 5 3.9096291 0.7029518 0.4291247 0.18414800 -1.22676124
## 6 -0.5019313 -1.9058829 NA NA -0.53958740
## 7 0.5816303 0.9389214 NA NA -2.31497909
## 8 1.0236009 -0.2244921 NA NA -0.03351108
## 9 -1.2942170 -0.6738168 NA NA -1.01040885
## 10 1.9041271 0.4457874 -0.8900378 0.79216734 -2.72923160
As shown, the xsq
variable is equal to the square of the
x
variable. Since the latter has missing values, so does
the former. We now impute the missing values in x
and
xsq
, compatibly with a substantive model for the outcome
y
which is specified as a linear regression, with
z
, x
and xsq
as covariates:
set.seed(123)
#impute missing values in x, compatibly with quadratic substantive model
<- smcfcs(ex_linquad, smtype="lm", smformula="y~z+x+xsq",method=c("","","norm","x^2","")) imps
## [1] "Outcome variable(s): y"
## [1] "Passive variables: xsq"
## [1] "Partially obs. variables: x"
## [1] "Fully obs. substantive model variables: z"
## [1] "Imputation 1"
## [1] "Imputing: x using z plus outcome"
## [1] "Imputation 2"
## [1] "Imputation 3"
## [1] "Imputation 4"
## [1] "Imputation 5"
As demonstrated here, the minimal arguments to pass to
smcfcs
are the data frame to be used, the substantive model
type, the substantive model formula, and a method vector. The
substantive model type specifies whether the model for the outcome is
linear, logistic or Cox regression, or a competing risks analysis (see
documentation). The smformula
specifies the linear
predictor of the substantive/outcome model. Here we specified that the
outcome y
is assumed to follow a linear regression model,
with z
, x
and xsq
as
predictors.
Lastly, we passed a vector of strings as the method
argument. This specifies, for each column in the data frame, the method
to use for imputation. As in the example, empty strings should be passed
for those columns which are fully observed and thus are not to be
imputed. For x
we specify norm
, in order to
impute using a normal linear regression model. See the help for
smcfcs
for the syntax for other imputation model types. For
xsq
we specify "x^2"
as the imputation method.
This instructs smcfcs
to impute xsq
by simply
squaring the imputed values of x
. Such a specification
could also be used with the mice
package, which implements
standard FCS MI. Note however that here, through specifying the
substantive model as including an effect of xsq
,
smcfcs
is imputing the missing values in x
which allows for a quadratic effect on y
.
Having generated the imputed datasets, we can now fit our substantive
model of interest. Here we make use of the mitools
package
to fit our substantive model to each imputed dataset, collect the
results, and combine them using Rubin’s rules:
# fit substantive model
library(mitools)
<- imputationList(imps$impDatasets)
impobj <- with(impobj, lm(y~z+x+xsq))
models summary(MIcombine(models))
## Multiple imputation results:
## with(impobj, lm(y ~ z + x + xsq))
## MIcombine.default(models)
## results se (lower upper) missInfo
## (Intercept) 0.9334715 0.04630014 0.8403438 1.026599 32 %
## z 1.0052498 0.03733622 0.9308187 1.079681 26 %
## x 0.9849376 0.03375213 0.9183668 1.051508 15 %
## xsq 1.0387792 0.02430037 0.9902690 1.087289 27 %
Here the data were simulated such that the coefficients of
z
, x
and xsq
are all 1. The
estimates we have obtained are (reassuringly) close to these true
parameter values. To illustrate the dangers of imputing a covariate
using an imputation model which is not compatible with the substantive
model, we now re-impute x
, but this time imputing
compatibly with a model for y
which does not allow for the
quadratic effect:
#impute missing values in x, compatibly with model for y which omits the quadratic effect
<- smcfcs(ex_linquad, smtype="lm", smformula="y~z+x",method=c("","","norm","x^2","")) imps
## [1] "Outcome variable(s): y"
## [1] "Passive variables: xsq"
## [1] "Partially obs. variables: x"
## [1] "Fully obs. substantive model variables: z"
## [1] "Imputation 1"
## [1] "Imputing: x using z plus outcome"
## [1] "Imputation 2"
## [1] "Imputation 3"
## [1] "Imputation 4"
## [1] "Imputation 5"
## Warning in smcfcs.core(originaldata, smtype, smformula, method,
## predictorMatrix, : Rejection sampling failed 3 times (across all variables,
## iterations, and imputations). You may want to increase the rejection sampling
## limit.
smcfcs
has issued some warnings about rejection
sampling. We discuss this later in this vignette, but here we will
continue and proceed to fit a model for y
which includes
both x
and xsq
(plus z
) as
covariates:
# fit substantive model
<- imputationList(imps$impDatasets)
impobj <- with(impobj, lm(y~z+x+xsq))
models summary(MIcombine(models))
## Multiple imputation results:
## with(impobj, lm(y ~ z + x + xsq))
## MIcombine.default(models)
## results se (lower upper) missInfo
## (Intercept) 1.2518297 0.07721520 1.0926282 1.4110313 45 %
## z 1.0923872 0.05635147 0.9799551 1.2048193 26 %
## x 0.9277074 0.06518336 0.7859114 1.0695035 63 %
## xsq 0.6093138 0.05299860 0.4845629 0.7340648 80 %
Now we have an estimate of the coefficient of xsq
of
0.60, which is considerably smaller than the true value 1 used to
simulate the data. This bias is due to the imputation model we have just
used for x
being misspecified. In particular, it was
misspecified due to the fact it wrongly assumed a linear dependence of
y
on x
, rather than allowing a quadratic
dependence.
smcfcs
One of the strengths of multiple imputation in general is the possibility to use variables in imputation models which are subsequently not involved in the substantive model. This may be useful in order to condition or adjust for variables which are predictive of missingness, but which are not used in the substantive model of interest. Moreover, adjusting for auxiliary variables which are strongly correlated with one or more variables which are being imputed improves efficiency.
When using smcfcs
to impute missing covariates,
auxiliary variables v
can be included by adding them as an
additional covariate in the substantive model, as passed using the
smformula
argument. Here we are imputing x
compatibly with a certain specification of model for the outcome. Our
substantive model of interest is then a simpler model which omits
v
. For example, in the quadratic example dataset, we can
add the auxiliary variable v
using:
#impute, including v as a covariate in the substantive/outcome model
<- smcfcs(ex_linquad, smtype="lm", smformula="y~z+x+xsq+v",method=c("","","norm","x^2","")) imps
## [1] "Outcome variable(s): y"
## [1] "Passive variables: xsq"
## [1] "Partially obs. variables: x"
## [1] "Fully obs. substantive model variables: z,v"
## [1] "Imputation 1"
## [1] "Imputing: x using z,v plus outcome"
## [1] "Imputation 2"
## [1] "Imputation 3"
## [1] "Imputation 4"
## [1] "Imputation 5"
## Warning in smcfcs.core(originaldata, smtype, smformula, method,
## predictorMatrix, : Rejection sampling failed 1 times (across all variables,
## iterations, and imputations). You may want to increase the rejection sampling
## limit.
# fit substantive model, which omits v
<- imputationList(imps$impDatasets)
impobj <- with(impobj, lm(y~z+x+xsq))
models summary(MIcombine(models))
## Multiple imputation results:
## with(impobj, lm(y ~ z + x + xsq))
## MIcombine.default(models)
## results se (lower upper) missInfo
## (Intercept) 0.9295283 0.04234163 0.8457342 1.013322 19 %
## z 1.0209851 0.03536731 0.9511068 1.090863 17 %
## x 0.9762357 0.03556863 0.9054434 1.047028 24 %
## xsq 1.0403799 0.02367377 0.9933561 1.087404 23 %
For outcome models other than linear regression, this approach is not
entirely justifiable due to the lack of collapsibility of non-linear
models. For example, if a Cox model is assumed for a failure time given
variables x
and v
, the hazard function given
only x
(i.e. omitting v
from the model) is no
longer a Cox model. Further research is warranted to explore how this
might affect the resulting inferences.
It is also possible to include the auxiliary variable v
without adding it to the outcome model (as given in the
smformula
argument), through specification of the
predictorMatrix
argument. Doing so conditions on
v
, but assumes that the outcome is independent of
v
, conditional on whatever covariates are specified in
smformula
. This should thus only be used when the latter
assumption is justified. When it is, inferences will in general be more
efficient. To make this assumption when imputing x
in the
ex_linquad
data, we define a predictorMatrix
which will specify that x
be imputed using both
z
and v
, but we omit v
from the
smformula
argument:
<- array(0, dim=c(ncol(ex_linquad),ncol(ex_linquad)))
predMatrix 3,] <- c(0,1,0,0,1)
predMatrix[<- smcfcs(ex_linquad, smtype="lm", smformula="y~z+x+xsq",method=c("","","norm","x^2",""),predictorMatrix=predMatrix) imps
## [1] "Outcome variable(s): y"
## [1] "Passive variables: xsq"
## [1] "Partially obs. variables: x"
## [1] "Fully obs. substantive model variables: z"
## [1] "Imputation 1"
## [1] "Imputing: x using z,v plus outcome"
## [1] "Imputation 2"
## [1] "Imputation 3"
## [1] "Imputation 4"
## [1] "Imputation 5"
<- imputationList(imps$impDatasets)
impobj <- with(impobj, lm(y~z+x+xsq))
models summary(MIcombine(models))
## Multiple imputation results:
## with(impobj, lm(y ~ z + x + xsq))
## MIcombine.default(models)
## results se (lower upper) missInfo
## (Intercept) 0.9335536 0.04695924 0.8384797 1.028628 36 %
## z 1.0140131 0.03476990 0.9455146 1.082512 14 %
## x 0.9866993 0.03979143 0.9051913 1.068207 42 %
## xsq 1.0393844 0.02260472 0.9949177 1.083851 12 %
Sometimes when running smcfcs
you may receive warnings
that the rejection sampling that smcfcs
uses has failed to
draw from the required distribution on a couple of occasions. Upon
receiving this warning, it is generally good idea to re-run
smcfcs
, specifying a value for rjlimit
which
is larger than the default, until the warning is no longer issued.
Having said that, when only a small number of warnings are issued, it
may be fine to ignore the warnings, especially when the dataset is
large.
Like standard chained equations or FCS imputation, the SMC-FCS
algorithm must be run for a sufficient number of iterations for the
process to converge to its stationary distribution. The default number
of iterations used is 10, but this may not be sufficient in any given
dataset and model specification To assess convergence, the object
returned by smcfcs
includes an object called
smCoefIter
. This matrix contains the parameter estimates of
the substantive model, and is indexed by imputation number, parameter
number, and iteration number. To assess convergence, one can call smcfcs
with m=1
and numit
suitably chosen
(e.g. numit=100
). The values in the resulting smCoefIter
matrix can then be plotted to assess convergence. To illustrate, we
re-run the imputation model used previously with the example data, but
asking for only m=1
imputation to be generated, and with
100 iterations.
# impute once with a larger number of iterations than the default 10
<- smcfcs(ex_linquad, smtype="lm", smformula="y~z+x+xsq",method=c("","","norm","x^2",""),predictorMatrix=predMatrix,m=1,numit=100) imps
## [1] "Outcome variable(s): y"
## [1] "Passive variables: xsq"
## [1] "Partially obs. variables: x"
## [1] "Fully obs. substantive model variables: z"
## [1] "Imputation 1"
## [1] "Imputing: x using z,v plus outcome"
## Warning in smcfcs.core(originaldata, smtype, smformula, method,
## predictorMatrix, : Rejection sampling failed 1 times (across all variables,
## iterations, and imputations). You may want to increase the rejection sampling
## limit.
# plot estimates of the fourth parameter of the substantive model against iteration number
plot(imps$smCoefIter[1,4,])
The plot shows that the process appears to converge rapidly, such
that the default choice of numit=10
is probably fine here.
Of course we should also examine the corresponding plots for the other
parameters of the substantive model, since convergence may require more
than 10 iterations for some of these.
Bartlett JW, Seaman SR, White IR, Carpenter JR. Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Statistical Methods in Medical Research, 2015; 24(4):462-487
van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 2011; 45(3)