If the package OTrecod is not installed in their current R versions, users can install it by following the standard instruction:
install.packages("OTrecod")
Each time an R session is opened, the OTrecod library must be loaded with:
library(OTrecod)
Moreover, the development version of OTrecod can be installed actually from GitHub with:
# Install development version from GitHub
::install_github("otrecoding/OTrecod") devtools
This vignette illustrates how to use the tools contained in the OTrecod package to solve a variable recoding problem frequently encountered in the context of data fusion. For more details about the theory of the algorithms used in the functions OT_outcome and OT_joint of the package, user can consult (1),(2) and the documentation linked to each function.
For this example, we have transformed an available database called samp.A from the StatMatch package (see (3) and the help of samp.A for more details). The samp.A database provides a limited number of variables observed at persons levels among those usually collected in the European Union Statistics on Income and Living Conditions Survey (the EU–SILC survey). In this database, the variable c.neti is a factor that corresponds to the persons net income of thousand of Euros categorized in 7 ordered classes. The raw distribution of c.neti is done in the following results:
library(StatMatch)
data(samp.A)
dim(samp.A)
#> [1] 3009 13
head(samp.A)
#> HH.P.id area5 urb hsize hsize5 age c.age sex marital edu7 n.income
#> 21384 10149.01 NE 1 1 1 85 (64,104] 2 3 3 1677
#> 35973 17154.02 NE 1 2 2 78 (64,104] 1 2 3 13520
#> 11774 5628.01 NO 2 1 1 48 (44,54] 1 3 3 20000
#> 32127 15319.01 S 1 2 2 78 (64,104] 1 2 1 12428
#> 6301 2973.05 S 2 5 >=5 17 [16,34] 1 1 1 0
#> 12990 6206.02 C 2 2 2 28 [16,34] 2 2 5 0
#> c.neti ww
#> 21384 (0,10] 3591.8939
#> 35973 (10,15] 415.1592
#> 11774 (15,20] 2735.4029
#> 32127 (10,15] 1239.5086
#> 6301 (-Inf,0] 5362.7588
#> 12990 (-Inf,0] 2077.7137
table(samp.A$c.neti) # Repartition of c.neti in the sample
#>
#> (-Inf,0] (0,10] (10,15] (15,20] (20,25] (25,35] (35, Inf]
#> 564 671 541 500 307 272 154
To construct a standard recoding problem in data fusion, the samp.A database has been transformed as follows:
= as.numeric(samp.A$c.neti)
c.neti $c.neti.bis = as.factor(ifelse(c.neti %in% c(1,2),1,
samp.Aifelse(c.neti %in% c(3,4),2,
ifelse(c.neti %in% c(5,6),3,4))))
= samp.A[1:200,c(2:3,5,7:9,12:13)]
data1 colnames(data1)[4] = "age"
head(data1)
#> area5 urb hsize5 age sex marital c.neti ww
#> 21384 NE 1 1 (64,104] 2 3 (0,10] 3591.8939
#> 35973 NE 1 2 (64,104] 1 2 (10,15] 415.1592
#> 11774 NO 2 1 (44,54] 1 3 (15,20] 2735.4029
#> 32127 S 1 2 (64,104] 1 2 (10,15] 1239.5086
#> 6301 S 2 >=5 [16,34] 1 1 (-Inf,0] 5362.7588
#> 12990 C 2 2 [16,34] 2 2 (-Inf,0] 2077.7137
= samp.A[201:350,c(3,5:6,8:11,13:14)]
data2 head(data2)
#> urb hsize5 age sex marital edu7 n.income ww c.neti.bis
#> 39565 3 2 81 2 2 1 6448 1129.274 1
#> 36490 1 >=5 48 1 3 2 16423 3082.331 2
#> 27529 2 2 66 1 2 3 15600 2433.020 2
#> 201 2 4 26 1 1 3 12876 1869.286 2
#> 31375 2 2 53 1 2 3 25633 2125.361 3
#> 3226 3 1 45 2 3 3 2611 1855.230 1
In conclusion, after transformation, we had:
table(data1$c.neti) # 7 levels in data1
#>
#> (-Inf,0] (0,10] (10,15] (15,20] (20,25] (25,35] (35, Inf]
#> 37 40 34 49 18 13 9
table(data2$c.neti.bis) # 4 levels in data2
#>
#> 1 2 3 4
#> 60 57 26 7
colnames(data1)
#> [1] "area5" "urb" "hsize5" "age" "sex" "marital" "c.neti"
#> [8] "ww"
colnames(data2)
#> [1] "urb" "hsize5" "age" "sex" "marital"
#> [6] "edu7" "n.income" "ww" "c.neti.bis"
intersect(colnames(data1), colnames(data2)) # the susbet of a priori shared variables
#> [1] "urb" "hsize5" "age" "sex" "marital" "ww"
Assuming that the encoding of c.neti is unknown for the subjects of data2 and that the encoding of c.neti.bis is unknown for the subjects of data1, the functions of the OTrecod package solve this recoding problem by predicting the missing scale of the persons net income in one or the two databases. This solution allows the user to fusion his two databases and finally works with a bigger, unique and synthetic dataset.
For the rest of the study, c.neti and c.neti.bis are called the target variables of data1 and data2 respectively. we deliberately limit this example to the prediction of the variable c.neti.bis in data1 but note that the proposed approach would be the same for the prediction of c.neti in data1.
Knowing the objective of the study, we first prepare the 2 databases to data fusion. The two functions dedicated to this data fusion in the OTrecod package expect a specific structure of database as argument. The merge_dbs function assists user in this task by:
= merge_dbs(data1, data2, NAME_Y = "c.neti", NAME_Z = "c.neti.bis",
db_test ordinal_DB1 = c(2,3,4,7), ordinal_DB2 = c(1:2,6,9))
#> DBS MERGING in progress. Please wait ...
#> DBS MERGING OK
#> -----------------------
#>
#> SUMMARY OF DBS MERGING:
#> Nb of removed subjects due to NA on targets: 0(0%)
#> Nb of removed covariates due to differences between the 2 bases: 1
#> Nb of remained covariates: 5
#> Imputation on incomplete covariates: NO
summary(db_test)
#> Length Class Mode
#> DB_READY 8 data.frame list
#> ID1_drop 0 -none- character
#> ID2_drop 0 -none- character
#> Y_LEVELS 7 -none- character
#> Z_LEVELS 4 -none- character
#> REMOVE1 1 -none- character
#> REMOVE2 0 -none- NULL
#> REMAINING_VAR 5 -none- character
#> IMPUTE_TYPE 1 -none- character
#> DB1_raw 8 data.frame list
#> DB2_raw 9 data.frame list
#> SEED 1 -none- numeric
$REMAINING_VAR
db_test#> [1] "hsize5" "marital" "sex" "urb" "ww"
$REMOVE1
db_test#> [1] "age"
$REMOVE2
db_test#> NULL
$ID1_drop; db_test$ID2_drop
db_test#> character(0)
#> character(0)
$DB_READY[c(1:5,201:205),] # The 5 1st subjects of the two databases
db_test#> DB Y Z hsize5 marital sex urb ww
#> 21384 1 (0,10] <NA> 1 3 2 1 3591.8939
#> 35973 1 (10,15] <NA> 2 2 1 1 415.1592
#> 11774 1 (15,20] <NA> 1 3 1 2 2735.4029
#> 32127 1 (10,15] <NA> 2 2 1 1 1239.5086
#> 6301 1 (-Inf,0] <NA> >=5 1 1 2 5362.7588
#> 39565 2 <NA> 1 2 2 2 3 1129.2739
#> 36490 2 <NA> 2 >=5 3 1 1 3082.3314
#> 27529 2 <NA> 2 2 2 1 2 2433.0201
#> 201 2 <NA> 2 4 1 1 2 1869.2859
#> 31375 2 <NA> 3 2 2 1 2 2125.3614
In output:
Among the set of shared variables kept in the output DB.READY database of the merge_dbs, it is important to discard those that never appear not good predictors of the persons net income whatever the considered database. The subset of remaining variables will be the matching variables. This selection of matching variables can be done using the select_pred function of the package.
This function proposes two levels of study to conclude:
The selection of the best predictors of the persons net income must be done in the two databases separately but the selec_pred functions allowed overlayed databases as arguments using the following code:
# for data1
= select_pred(db_test$DB_READY,Y = "Y", Z = "Z", ID = 1, OUT = "Y",
test_DB1 quanti = 8, nominal = c(1,5:6,7), ordinal = c(2:4),
convert_num = 8, convert_clss = 4,
thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.10,
RF = TRUE, RF_SEED = 3017)
#> The select_pred function is running. Please wait ...
#> Risk of collinearity between predictors detected: Some predictors will be dropped during RF process
# for data2
= select_pred(db_test$DB_READY,Y = "Y", Z = "Z", ID = 1, OUT = "Z",
test_DB2 quanti = 8, nominal = c(1,5:6,7), ordinal = c(2:4),
convert_num = 8, convert_clss = 4,
thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.10,
RF = TRUE, RF_SEED = 3017)
#> The select_pred function is running. Please wait ...
#>
#> Risk of collinearity between predictors detected: Some predictors will be dropped during RF process
As input:
Notice that it is important to keep the same arguments in the two selections (test_DB1 and test_DB2) for an optimal comparability. Here the standard RF process is used, and the ww variable is converted in categorical type before the selection using convert_num.
Let see the result for data1:
summary(test_DB1)
#> Length Class Mode
#> seed 1 -none- numeric
#> outc 1 -none- character
#> thresh 3 -none- numeric
#> convert_num 1 -none- character
#> DB_USED 8 data.frame list
#> vcrm_OUTC_cat 5 data.frame list
#> cor_OUTC_num 5 data.frame list
#> vcrm_X_cat 5 data.frame list
#> cor_X_num 5 data.frame list
#> FG_test 3 -none- numeric
#> collinear_PB 2 -none- list
#> drop_var 1 -none- character
#> RF_PRED 4 -none- numeric
#> RF_best 2 -none- character
$vcrm_OUTC_cat
test_DB1#> name1 name2 V_Cramer CorrV_Cramer N
#> 4 Y sex 0.4384 0.4036 200
#> 3 Y marital 0.2452 0.1740 200
#> 5 Y urb 0.1703 0.0000 200
#> 2 Y hsize5 0.1634 0.0000 200
#> 6 Y ww 0.1465 0.0000 200
$collinear_PB
test_DB1#> $VCRAM
#> name1 name2 V_Cramer CorrV_Cramer N
#> 21 hsize5 marital 0.5038 0.4858 200
#>
#> $SPEARM
#> [1] name1 name2 RANK_COR pv_COR_test N
#> <0 rows> (or 0-length row.names)
# Results from RF
$drop_var
test_DB1#> [1] "marital"
$RF_PRED
test_DB1#> sex hsize5 urb ww
#> 86.8886 7.9677 5.1437 0.0000
In the set of shared variables, all variables are now categorical (factors or ordered factors). According to the vcrm_OUTC_cat output object, the best predictors of c.neti are: sex, marital, urb, hsize5, and ww in that order. Nevertheless, a risk of collinearity is detected between marital and hsize5. The RF process finally suggests to drop marital and to keep only sex, hsize5 and eventually urb as matching variables.
Let see the result for data2:
summary(test_DB2)
#> Length Class Mode
#> seed 1 -none- numeric
#> outc 1 -none- character
#> thresh 3 -none- numeric
#> convert_num 1 -none- character
#> DB_USED 8 data.frame list
#> vcrm_OUTC_cat 5 data.frame list
#> cor_OUTC_num 5 data.frame list
#> vcrm_X_cat 5 data.frame list
#> cor_X_num 5 data.frame list
#> FG_test 3 -none- numeric
#> collinear_PB 2 -none- list
#> drop_var 1 -none- character
#> RF_PRED 4 -none- numeric
#> RF_best 2 -none- character
$vcrm_OUTC_cat
test_DB2#> name1 name2 V_Cramer CorrV_Cramer N
#> 4 Z sex 0.4783 0.4583 150
#> 2 Z hsize5 0.2343 0.1693 150
#> 3 Z marital 0.1828 0.1160 150
#> 5 Z urb 0.1386 0.0000 150
#> 6 Z ww 0.0949 0.0000 150
$collinear_PB
test_DB2#> $VCRAM
#> name1 name2 V_Cramer CorrV_Cramer N
#> 21 hsize5 marital 0.3555 0.3177 150
#>
#> $SPEARM
#> [1] name1 name2 RANK_COR pv_COR_test N
#> <0 rows> (or 0-length row.names)
# Results from RF
$drop_var
test_DB2#> [1] "marital"
$RF_PRED
test_DB2#> sex hsize5 ww urb
#> 86.1852 13.1555 0.6593 0.0000
According to the vcrm_OUTC_cat output object, the best predictors of c.neti.bis are: sex, hsize5, marital, urb, and ww in that order. A risk of collinearity is also detected between marital and hsize5. The RF process here suggests to drop marital and to keep only sex, hsize5 as matching variables.
In conclusion:
The matching variable groups can be:
We finally keep the last group and dropped the other ones for the rest of the example.
= db_test$DB_READY[,-c(5,8)]
match_var c(1:5,201:205),]
match_var[#> DB Y Z hsize5 sex urb
#> 21384 1 (0,10] <NA> 1 2 1
#> 35973 1 (10,15] <NA> 2 1 1
#> 11774 1 (15,20] <NA> 1 1 2
#> 32127 1 (10,15] <NA> 2 1 1
#> 6301 1 (-Inf,0] <NA> >=5 1 2
#> 39565 2 <NA> 1 2 2 3
#> 36490 2 <NA> 2 >=5 1 1
#> 27529 2 <NA> 2 2 1 2
#> 201 2 <NA> 2 4 1 2
#> 31375 2 <NA> 3 2 1 2
This overlayed database is now ready for recoding.
The OTrecod package proposes actually two algorithms using optimal transportation theory (see (3) for details) to solve the recoding problem previously introduced. Each algorithm is stored in a unique function called OT_outcome and OT_joint. These two functions can predict the missing values of c.neti in data2, the missing values of c.neti.bis in data1 or the both using a same argument called which.DB.
As with the select_pred function, it is possible to transform directly the continuous matching variables if necessary using the convert_num and convert_clss arguments.
Let see the R approach for the prediction of c.neti.bis in data1.
The algorithm from OT_outcome function solves an optimization problem to transfer the distributions of the target variables (or outcome) from one database to another. Using this result, the initial version of the algorithm (see (2)) transfers the distribution of c.neti.bis in data2 to the distribution of c.neti.bis in data1, and (inversely for c.neti if necessary). The algorithm executes in another distinct step, a nearest neighbor procedure to affect the indidividual predictions of c.neti.bis in data1. This version of the algorithm is actually available by writing sequential as argument method of the OT_outcome function.
The algorithm assumes the strong hypothesis that the variable c.neti.bis has the same distribution in data1 and data2 (and so on for the variable c.neti). If in this example, by construction, this hypothesis is obviously verified, there are also several contexts where this latter no longer holds.
Enrichments have been thus proposed (see (2)) to relax this hypothesis via the maxrelax argument of OT_outcome and directly provides the individual predictions without using the nearest neighbor process. This algorithm is actually available by assigning the method argument to optimal in input of the OT_outcome function. For our example, the corresponding R code for the prediction of c.neti.bis in data1 is:
# sequential algorithm
= OT_outcome(match_var, nominal = c(1,5:6), ordinal = 2:4, dist.choice = "E",
mod1_seq maxrelax = 0 , indiv.method = "sequential", which.DB = "A")
#> ---------------------------------------
#> OT PROCEDURE in progress ...
#> ---------------------------------------
#> Type = OUTCOME
#> Distance = Euclidean
#> Percent closest knn = 100%
#> Relaxation parameter = NO
#> Relaxation value = 0
#> Individual pred process = Sequential
#> DB imputed = A
#> ---------------------------------------
summary(mod1_seq)
#> Length Class Mode
#> time_exe 1 difftime numeric
#> gamma_A 28 -none- numeric
#> gamma_B 28 -none- numeric
#> profile 5 data.frame list
#> res_prox 16 -none- list
#> estimatorZA 812 -none- numeric
#> estimatorYB 0 -none- NULL
#> DATA1_OT 8 data.frame list
#> DATA2_OT 7 data.frame list
# optimal algorithm with no relaxation term
= OT_outcome(match_var, nominal = c(1,5:6), ordinal = 2:4, dist.choice = "E",
mod2_opt maxrelax = 0, indiv.method = "optimal", which.DB = "A")
#> ---------------------------------------
#> OT PROCEDURE in progress ...
#> ---------------------------------------
#> Type = R-OUTCOME
#> Distance = Euclidean
#> Percent closest knn = 100%
#> Relaxation parameter = NO
#> Relaxation value = 0
#> Individual pred process = Optimal
#> DB imputed = A
#> ---------------------------------------
head(mod2_opt$profile)
#> ID sex_2 urb_2 urb_3 hsize5
#> 21384 P_1 1 0 0 1
#> 35973 P_2 0 0 0 2
#> 11774 P_3 0 1 0 1
#> 6301 P_4 0 1 0 5
#> 12990 P_5 1 1 0 2
#> 39835 P_6 1 1 0 4
dim(mod2_opt$profile)
#> [1] 29 5
$gamma_A
mod2_opt#> [,1] [,2] [,3] [,4]
#> [1,] 0.185 0.000 0.00000000 0.000000000
#> [2,] 0.200 0.000 0.00000000 0.000000000
#> [3,] 0.015 0.155 0.00000000 0.000000000
#> [4,] 0.000 0.225 0.02000000 0.000000000
#> [5,] 0.000 0.000 0.09000000 0.000000000
#> [6,] 0.000 0.000 0.06333333 0.001666667
#> [7,] 0.000 0.000 0.00000000 0.045000000
head(mod2_opt$DATA1_OT)
#> DB Y Z sex_2 urb_2 urb_3 hsize5 OTpred
#> 21384 1 (0,10] <NA> 1 0 0 1 1
#> 35973 1 (10,15] <NA> 0 0 0 2 2
#> 11774 1 (15,20] <NA> 0 1 0 1 2
#> 32127 1 (10,15] <NA> 0 0 0 2 2
#> 6301 1 (-Inf,0] <NA> 0 1 0 5 1
#> 12990 1 (-Inf,0] <NA> 1 1 0 2 1
As input:
In output, these two models provides lists of same structure:
The verif_OT function gives access to different tools for assessing the reliability of the individual predictions proposed by the OT_outcome and OT_joint functions.
# Validation of the mod1_seq model
= verif_OT(mod1_seq, stab.prob = TRUE, min.neigb = 3)
verif_out1 $conf.mat
verif_out1#> predZ
#> Y 1 2 3 4 Sum
#> (-Inf,0] 37 0 0 0 37
#> (0,10] 40 0 0 0 40
#> (10,15] 3 31 0 0 34
#> (15,20] 1 45 3 0 49
#> (20,25] 1 1 16 0 18
#> (25,35] 1 1 11 0 13
#> (35, Inf] 1 1 1 6 9
#> Sum 84 79 31 6 200
$res.prox
verif_out1#> N V_cram rank_cor
#> 200.000 0.860 0.871
$res.stab
verif_out1#> N min.N mean sd
#> 1st DB 112 3 0.976 0.107
# Validation of the mod2_seq model
= verif_OT(mod2_opt, stab.prob = TRUE, min.neigb = 3)
verif_out2 $conf.mat
verif_out2#> predZ
#> Y 1 2 3 4 Sum
#> (-Inf,0] 37 0 0 0 37
#> (0,10] 40 0 0 0 40
#> (10,15] 3 31 0 0 34
#> (15,20] 0 45 4 0 49
#> (20,25] 0 0 18 0 18
#> (25,35] 0 0 13 0 13
#> (35, Inf] 0 0 0 9 9
#> Sum 80 76 35 9 200
= (37+40+31+45+18+13+9)/200
rate_good_pred
rate_good_pred#> [1] 0.965
$res.prox
verif_out2#> N V_cram rank_cor
#> 200.000 0.960 0.939
$res.stab
verif_out2#> N min.N mean sd
#> 1st DB 112 3 0.986 0.084
# Validation of the mod3_opt model
= verif_OT(mod3_joint, stab.prob = TRUE, min.neigb = 3)
verif_jt $conf.mat
verif_jt#> predZ
#> Y 1 2 3 4 Sum
#> (-Inf,0] 33 4 0 0 37
#> (0,10] 32 8 0 0 40
#> (10,15] 4 29 1 0 34
#> (15,20] 11 19 19 0 49
#> (20,25] 0 10 5 3 18
#> (25,35] 1 2 10 0 13
#> (35, Inf] 6 1 0 2 9
#> Sum 87 73 35 5 200
$res.prox
verif_jt#> N V_cram rank_cor
#> 200.000 0.550 0.613
$res.stab
verif_jt#> N min.N mean sd
#> 1st DB 112 3 0.869 0.175
As input:
As output, we have selected the most interesting one to make the comparison between the three tested models:
The two first models are here more adapted when the strong distributional hypothesis previously introduced holds. Nevertheless, the mod3_jt model could be significantly improved by adding appropriate relaxation and regularization terms, so now, R user, test it by yourself !
Gares V, Dimeglio C, Guernec G, Fantin F, Lepage B, Korosok MR, savy N (2019). On the use of optimal transportation theory to recode variables and application to database merging. The International Journal of Biostatistics.Volume 16, Issue 1, 20180106, eISSN 1557-4679.
Gares V, Omer J (2020). Regularized optimal transport of covariates and outcomes in data recoding. Journal of the American Statistical Association.
D’Orazio, M (2015). Integration and imputation of survey data in R: the StatMatch package. Romanian Statistical Review, vol. 63(2), pages 57-68
Strobl C, Boulesteix A-L, Kneib T, Augustin T, Zeileis A (2008). Conditional Variable Importance for Random Forests. BMC Bioinformatics, 9, 307 - https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-307