If you work with data, you have almost certainly encountered issues regarding missing data. The most commonly used statistical methods often only work with complete sets of observations and it can be a huge pain when you realize that you can only utilize a fraction of your data because of missing data points here and there. Indeed, just 5% random missing data in a huge dataset can dramatically reduce the number of complete observations to as low as 25% of the original number of observations! Scary! Thankfully, there are numerous algorithms that can help you tackle this issue, but often, the most commonly used methods, such as mean and median imputation, offer suboptimal solutions and sophisticated algorithms often need a lot of consideration and background knowledge. The authors wrote the missCompare package to help you through your missing data problem. The initial hypothesis of the authors was that datasets with different parameters (size, distributions, correlations, etc.) will benefit differently from various missing data imputation algorithms.
This tutorial will navigate you through the missCompare package and its functions. So… let’s jump right into it!
First, let’s begin by installing and loading the missCompare package!
To walk you through the missCompare missing data imputation pipeline, the authors created the missCompare::clindata_miss
dataframe. As the name suggest, clindata_miss contains missing data. You can load this dataframe into your workspace by:
The data is created by the authors with the purpose of generating a dataset that resembles real-life data with variables generally available in a simple, realistic clinical epidemiology dataset. This R dataframe have realistic variable means, distributions, correlations and missing data patterns. You can find further information on the dataframe using ?clindata_miss
.
…Some words on alternative imputation methods. Before you get yourself acquainted with missCompare and its functions, it is always a good idea to carefully observe, and do some thinking about your data, your variables/features and your missing data. Before trying fancy methods, there are some cases when deductive imputation can be applied. This is a form of imputation based on logical reasoning rather than statistical methods. Let’s consider four separate cases when this can be applied:
Constant variables: In this scenario, in a prospective dataset, sex information is available at a participant’s first visit in 1998 and third visit in 2005, but missing in 2000. With logical deduction, it can be concluded that biological sex is constant, therefore the missing data point is also Female
ID | Date | Sex | BMI |
---|---|---|---|
ID13 | 01-01-1998 | Female | 25.8 |
ID13 | 01-01-2000 | NA | 27.7 |
ID13 | 01-01-2005 | Female | 27.0 |
Prospective changes: In this scenario, in a prospective dataset, age information is available at a participant’s first visit in 1998, but missing in 2000 and 2005. With logical deduction, it can be concluded that missing age information is 25 and 30 for the missing two years
ID | Date | Age | BMI |
---|---|---|---|
ID13 | 01-01-1998 | 23 | 25.8 |
ID13 | 01-01-2000 | NA | 27.7 |
ID13 | 01-01-2005 | NA | 27.0 |
Cumulative variables: In this scenario, in a prospective dataset, there is information available on how many times the participant has been pregnant. This information is available at a participant’s last visit in 2005, but missing in 1998 and 2000. With logical deduction, it can be concluded that if the participant has never been pregnant in 2005, than she has never been pregnant earlier either, therefore the missing numbers are 0 and 0
ID | Date | Number of pregnancies | BMI |
---|---|---|---|
ID13 | 01-01-1998 | NA | 25.8 |
ID13 | 01-01-2000 | NA | 27.7 |
ID13 | 01-01-2005 | 0 | 27.0 |
Dependent variables / known formula: In some cases, there could be a known relationship between variables. Such case is when the variables height, weight and BMI are present in the data, and some values are randomly missing. As BMI is calculated as weight (kg) /height2 (m2), knowing the formula it can be deducted that the values below are 70 kg and 31.2 kg/m2. A similar scenario in clinical datasets would be missing laboratory lipid values, where LDL cholesterol values are calculated by the Friedewald formula from triglycerides, total cholesterol and HDL cholesterol values. Another classic scenario is linearly dependent variables, where the sum of two variables produce a third variable.
ID | Weight_kg | Height_meters | BMI |
---|---|---|---|
ID13 | NA | 1.65 | 25.8 |
ID15 | 74 | 1.78 | 23.4 |
ID16 | 89 | 1.69 | NA |
All of the above scenarios need prior knowledge of the data, which is “easy” for humans, but is very hard to compute into any single missing data imputation algorithm. Therefore, it is always suggested that the variables and their relationships are thoroughly understood before jumping into hardcore data imputation using a fancy algorithm.
Also, it is highly recommended to check data ranges. By running a simple summary()
on your data you can check minimum and maximum values for each numeric variables and browse through all categories for factors. Oftentimes it happens that a variable - a blood biomarker, for instance - has a minus value as its minimum. While this can be OK in some cases (e.g. the variable is already normalized), but sometimes this can be a data input error or missing data points can be coded as minus values. In these cases, it is important that the minus values are set to NA.
Another important aspect that should be considered are outliers. This tutorial is not meant to delve too deep into this, but it is suggested that outliers are visualized for each variables and are excluded if they are too extreme or represent non-sense values. E.g. an age value of 109 is definitely an outlier, but physiologically possible, but 309 is probably due to a data input error, therefore it should be set to NA.
The missCompare pipeline is comprised of the following steps:
missCompare::clean()
missCompare::get_data()
missCompare::simulated()
missCompare::all_patterns()
. These patterns are:
missCompare::MCAR()
- missing data occurrence randommissCompare::MAR()
- missing data occurrence correlates with other variables’ valuesmissCompare::MNAR()
- missing data occurrence correlates with variables’ own valuesmissCompare::MAP()
- a combination of the previous three, where the user can define a pattern per variablemissCompare::impute_simulated()
missCompare::impute_data()
missCompare::post_imp_diag()
.Now let’s look into these, step by step!
The first step before starting anything imputation related is getting to know your data and cleaning it if necessary. This is not an “automated” process, nor should it be. You always need to explore your data before missing data imputation - it is entirely up to you what to keep and what to drop from your dataset before the imputation process. After the cleaning steps, missCompare will essentially reconstruct your dataset with no missing data, and in order to achieve this, your dataframe cannot have variables of type character or factor (in a later stage you will be able to use your original factor variables, but to extract metadata at this step, a conversion is necessary).
Let’s take clindata_miss and run missCompare::clean()
. In one step, you can achieve multiple cleaning steps:
cleaned <- missCompare::clean(clindata_miss,
var_removal_threshold = 0.5,
ind_removal_threshold = 0.8,
missingness_coding = -9)
#> Variable(s) sex, education converted to numeric.
#> Variable(s) PPG removed due to exceeding the pre-defined removal threshold (>50%) for missingness.
#> 1 individual(s) removed due to exceeding the pre-defined removal threshold (>80%) for missingness.
The cleaning function outputs a dataframe called cleaned
. In the next step, where you extract important metadata from your dataframe, you will continue working with this object.
You will get to know important details of your cleaned dataframe using the following line:
If there are non-numeric variables in the data, a warning will appear: Warning! Variable(s) sex, education is/are not numeric. Convert this/these variables to numeric using missCompare::clean() and repeat the missCompare::get_data() function until no warnings are shown.
As we cleaned clindata_miss in the previous step, the function should run without any warnings and errors. The output object, metadata
is a list of the following elements.
metadata$Complete_cases
- the number of complete observations with no missing data. In this case, there are 1349 observations from the original 2500 with no missing data. Note, that with only ~7% missing data, the number of complete observations is dramatically decreased to around half of the original data. As many methods (e.g. regression, principal component analysis, etc.) require complete sets of observations, this decrease in complete observations presents a very strong case for missing data imputation.metadata$Rows
and metadata$Columns
give you the dimensions of the dataframe, 2499 rows and 11 columns, in our case.metadata$Corr_matrix
- gives you the correlation matrix of all variables in the dataframe. The function assesses pairwise Pearson correlation coefficients using complete sets of pairs between any two variables. For example there are 2312 complete BMI-waist pairs (sum(!is.na(cleaned$BMI) & !is.na(cleaned$waist))
) - in this case correlation will be calculated based on these 2312 value pairs, even though the number of complete observations (no NAs for any variables) is 1349. This way all of the data will be utilized to obtain the most accurate correlation coefficients.metadata$Fraction_missingness
and metadata$Fraction_missingness_per_variable
return the the total fraction of missingness in the dataframe and the fraction of missingness per variable, respectively. In our case, the total fraction of missingness is 6.9% after removing one variable, PPG with >50% missing. Inspecting the fraction of missingness per variable it is also apparent that the true missing data fraction of TG and HDL-C after converting -9s to NAs are 10.6% and 13.7%, respectively.metadata$Total_NA
and metadata$NA_per_variable
return the total number of missing values in the dataframe and per variable.metadata$MD_Pattern
- The missing data pattern is calculated using mice::md.pattern (?mice::md.pattern
).metadata$NA_Correlations
- Pairwise correlation point-biserial correlation coefficients between variable pairs. Namely, for each calculation, variables are converted to TRUE/FALSE based on missing data status (missing or not missing) and these boolean vectors and correlated to the native variables. The resulting correlation matrix has rows with variable names + _is.na - the rows represent the converted variables, while the columns are the original variables.metadata$NA_Correlation_plot
- A correlation plot based on the previous object, metadata$NA_Correlations
. metadata$min_PDM_thresholds
- This object is a small dataframe that informs about the various min_PDM (minimum number of observations per distinct missing data pattern) values. min_PDM will need to be defined in further functions, so it is informative to get a sense on what is the proportion of missing data patterns that should be used in the simulation framework. In the table below we see that if this number is too high (only those missing data patterns are considered that describe 100 or more observations) than more than 80% of observations with other missing data patterns will be ignored in the simulation framework. On the other hand, if this number is set too low, e.g. below 5, than this may result in retaining almost all missing data patterns that are applicable to only a few observations resulting in a very complicated pattern that the algorithm might fail to apply to your dataset. Therefore, this number will need to be set carefully later. This table shows approximate percentages of observations with missing data retained with setting min_PDM in later steps.#> min_PDM_threshold perc_obs_retained
#> 1 5 82
#> 2 10 78
#> 3 20 75
#> 4 50 53
#> 5 100 19
#> 6 200 0
#> 7 500 0
#> 8 1000 0
metadata$Vars_above_half
- In case there are variables exceeding 50% missingness, this object will list these variables and the function will output a warning: Warning! Missingness exceeds 50% for variable(s) PPG. Although the pipeline will function with variables with high missingness, consider excluding these variables using missCompare::clean() and repeating function until no warnings are shown.metadata$Matrix_plot
- Matrix plot (ggplot2 object) visualizing missing data distribution. Missing data are represented by gray color, while data is colored by value. There are options to sort the dataframe by missingness or leave the dataframe unsorted. This plot (hopefully) gives you a visual description of how data is missing together and what fraction of the data is missing.metadata$Cluster_plot
- Dendrogram (ggplot2 object) visualizing the hierarchical clustering of missing data. On the plot, the clades (bifurcations) closer to Height 0 (the bottom of the plot) represent closer relationships in terms of missing data - i.e. those variables in one group are more likely to be missing together compared to the rest. In our case, blood pressures SBP and DBP are missing together often and so as the three lipid traits, TC, TG and HDL-C. With respect to the other variables, waist and BMI have a closer relationship compared to the rest.Having learned the characteristics of your data, the next step is to assemble a framework in which you can test the performance of various missing data algorithms on simulated data. This can be achieved in one step using missCompare::impute_simulated()
, but let’s look under the hood and see how this function is operating in the background. You can access these functions in case you would like to play with parameters.
First, missCompare::simulate()
will create a dataset that resembles your original dataframe, but with no missing data. The dimensions of the simulated dataframe and all correlations between variables will match the original data - essentially, the only major difference between the simulated dataset and your original data is that here, all variables are normally distributed with a custom set mean and standard deviation. Let’s observe:
simulated <- missCompare::simulate(rownum = metadata$Rows,
colnum = metadata$Columns,
cormat = metadata$Corr_matrix,
meanval = 0,
sdval = 1)
Note how the first three arguments are elements of the output list (metadata
) from the previous function missCompare::get_data()
.
The output simulated
is a list, where the first item is the simulated matrix and the next two objects are samples from the correlation matrix of the original and the simulated data.
With this done, missing data can be spiked in in the simulated data using various patterns (MCAR, MAR, MNAR and MAP). This will be done with respect to the missing data fraction for each variable in the original dataframe.
MCAR missingness is simple, NAs are spiked in randomly. Please note that in all functions below, min_PDM is set to 10. As it was shown before when looking at outputs from missCompare::get_data()
, this will result in the retention of the most important missing data patterns in which approximately 78% of all observations will be accounted for.
missCompare::MCAR(simulated$Simulated_matrix,
MD_pattern = metadata$MD_Pattern,
NA_fraction = metadata$Fraction_missingness,
min_PDM = 10)
MAR and MNAR missingness follow similar logic. With MNAR, NAs are spiked in based on the variable’s own values, while with MAR, NAs are spiked in based on all other variables’ values.
missCompare::MAR(simulated$Simulated_matrix,
MD_pattern = metadata$MD_Pattern,
NA_fraction = metadata$Fraction_missingness,
min_PDM = 10)
missCompare::MNAR(simulated$Simulated_matrix,
MD_pattern = metadata$MD_Pattern,
NA_fraction = metadata$Fraction_missingness,
min_PDM = 10)
MAP missingness is a possible combination of any of the previous three missing data pattern. You can define what pattern you assume per variable. In our case, let’s say that all variables are missing randomly, except for education (the last variable), which is missing not at random. This can be defined as:
missCompare::MAP(simulated$Simulated_matrix,
MD_pattern = metadata$MD_Pattern,
NA_fraction = metadata$Fraction_missingness,
min_PDM = 10,
assumed_pattern = c(rep("MCAR", 10), "MNAR"))
All patterns can be run with a single line using missCompare::all_patterns()
.
Again, missCompare::impute_simulated()
does all these in one step, and more:
1. it simulates the dataset with no missing values;
2. spikes in NAs in MCAR, MAR, MNAR and - if you define a custom pattern - MAP patterns;
3. imputes missing data using a curated list of 16 algorithms;
4. compares computation time, evaluates imputation accuracy by calculating RMSE, MAE and KS values between the imputed and the original data points.
This all can be achieved by a single command! This means you DON’T have to run missCompare::simulate()
, missCompare::MCAR()
, missCompare::MAR()
, missCompare::MNAR()
, missCompare::MAP()
or missCompare::all_patterns()
. The command goes as (note that we defined assumed_pattern = NA
- the default - so MAP pattern will be skipped):
missCompare::impute_simulated(rownum = metadata$Rows,
colnum = metadata$Columns,
cormat = metadata$Corr_matrix,
MD_pattern = metadata$MD_Pattern,
NA_fraction = metadata$Fraction_missingness,
min_PDM = 10,
n.iter = 50,
assumed_pattern = NA)
The n.iter
argument controls how many times the process will repeat itself (spiking in missing data, imputation and assessment of imputation accuracy, that is).
When this is completed (depending on the size of you data, it can take some time!), you should obtain graphs that give you visual aid on which algorithms perform in the most desired fashion. The less the error, and the more the imputed values’ distributions match the original values’ distributions, the better - i.e. lower RMSE, MAE and KS D statistic values are better! Computation time of course can also play a role in your decision on what to use eventually!
Let’s look at some plots:
Please note that RMSE and MAE measures correlate strongly. It is interesting to consider that there are two different main imputation strategies - one group of methods focus more on prediction, while the others perform imputation per se. To give an intuitive explanation, while predictive algorithms set up models that try to predict missing data based on all the other data available and thereby often ignore the uncertainty of missing values, multiple imputation algorithms focus more on variables distributions.
Hopefully, by now you have a good idea on which method you would like to use to impute missing data for your dataset - you will be able to do this using missCompare::impute_data()
. Please note that while some algorithms can safely handle factors (random replacement, mice, mi, missForest, Hmisc, VIM), some do not (mean and median imputation, missMDA algorithms; pcaMethods algorithms; Amelia II). If you have factors and would like to use one of the methods that are able to handle them, you can directly use your original dataset here (without string variables). In case your choice is another method or you don’t have factors in your original data, you can safely used the dataframe output by missCompare::clean()
.
Let’s consider a scenario where you choose missForest to impute your data. You would like to consider the original factor variables, you do not want to scale your data and you would like to create 10 imputed dataframes in order to set up a multiple imputation framework. Note, that setting up a multiple imputation framework is only a viable option with probabilistic models (e.g. you can impute you data with Mean imputation a billion times, but the NAs will always be replaced with exactly the same mean values!).
imputed <- missCompare::impute_data(clindata_miss,
scale = F,
n.iter = 10,
sel_method = c(14)) # 14 is the code for missForest
The 10 imputed dataframes will be available under imputed$missForest_imputation[[1]]
, imputed$missForest_imputation[[2]]
, imputed$missForest_imputation[[3]]
, etc…, while all the other elements in the list will be empty (e.g. nothing in imputed$median_imputation
).
Let’s see what happens when you similarly define 10 copies, but choose a non-probabilistic method - Mean imputation in this case.
imputed <- missCompare::impute_data(cleaned,
scale = T,
n.iter = 10,
sel_method = c(3)) # 3 is the code for mean imputation
#> [1] "Mean imputation - in progress"
You will see that the algorithm runs only 1 iteration for Mean imputation, and there is only one object, under imputed$mean_imputation[[1]]
. There are no imputed$mean_imputation[[2]]
, imputed$mean_imputation[[3]]
, etc… created at all.
Well done! You have imputed your data. Almost there, only one step remains - to assess how the imputation affected your dataset.
The function missCompare::post_imp_diag()
will help you assess how the imputation affected important parameters of your data.
Let’s run this with the results of the Mean imputation algorithm (imputed$mean_imputation[[1]]
):
The diag
list contains several interesting assessments. The Histograms
and Boxplots
lists should give an idea on the distributions of the original and the imputed values. Perhaps the limitations of mean imputation are the most apparent on these plots. For example, let’s inspect SBP levels (scaled to mean 0 in the data!):
This, of course will look different for almost any other models, e.g. Amelia II:
The Barcharts
list only contains elements if there were factors in the original data.
The Statistics
list shows the mean and SD values for the original and imputed values per variables and calculates Welch’s t test P values (testing for mean difference) and Kolmogorov-Smirnov test statistic D (testing for unequal distributions) on the two groups of values. Low Welch’s t test P values and high Kolmogorov-Smirnov test statistics reflect worse results.
#> $age
#> Mean_original SD_original Mean_imputed SD_imputed Welch_ttest_P
#> 2.543198e-16 1.000000e+00 1.755020e-01 1.083384e+00 1.817019e-01
#> KS_test.D
#> 1.396733e-01
#>
#> $sex
#> Mean_original SD_original Mean_imputed SD_imputed Welch_ttest_P
#> 1.368959e-16 1.000000e+00 -7.847479e-02 1.057142e+00 5.360356e-01
#> KS_test.D
#> 4.071899e-01
The Variable_clusters_orig
and Variable_clusters_imp
plots visualize the clusters of variables before and after the imputation. The clusters should not change too much!
Great, these two do look very similar with only very small differences.
Finally, the Correlation_plot
shows bootstrapped correlation coefficients from the original data and the imputed data. You can specify how many iterations you would like to run to obtain bootstrapped correlation coefficients and 95% confidence intervals using the n.boot
argument in the function. In this vignette this is set to 5, which is far too low. The blue line has intercept 0 and slope 1 and correlation coefficients (represented by the dots and the red line) should align this line as much as possible. With mean and median imputations, the line is often “turned clockwise”, with a lower slope than 1 (regression to the mean).
#> `geom_smooth()` using formula 'y ~ x'
Well done! You completed the tutorial and by now, you should ideally have one or more imputed datasets of your data. Awesome job! We hope that you achieved what you were set out to do and you are ready for the next steps in your statistical analysis! Please let us know if you have any issues here - missCompare GitHub - or contact the authors if you have any feedback!
– authors of missCompare