In this vignette, we illustrate our package with the two real data sets used in the article https://hal.archives-ouvertes.fr/hal-02936779v3.
install.packages("ZIprop")
install.packages("knitr")
install.packages("kableExtra")
install.packages("ggplot2")
install.packages("ggrepel")
install.packages("ggthemes")
install.packages("stringr")
Note that knitr and kableExtra are used to visualize tables, ggplot2, ggrepel and ggthemes to plot graphics and stringr to manipulate char.
suppressPackageStartupMessages(library(ZIprop))
library(knitr)
library(kableExtra)
library(ggplot2)
library(ggrepel)
library(ggthemes)
library(stringr)
We consider an Equine Influenza outbreak in 2003 in race horses from different training yards in Newmarket. In what follows, we use four discrete factors and one continuous factor computed from the observed variables age, sex and training yard:
The weight is the probability of transmission between horses.
Some of these factors are missing for some target-contributor pairs. Hence, the tests for assessing the effect of a given factor on the transmission probability are applied on the subset of complete data for this factor. The data set used for this study is available in the package or at this link https://doi.org/10.5281/zenodo.4837560.
data(equineDiffFactors)
summary(equineDiffFactors)
#> ID.source ID.recep weight SameYard
#> Length:2256 Length:2256 Min. :0.0000000 Length:2256
#> Class :character Class :character 1st Qu.:0.0000000 Class :character
#> Mode :character Mode :character Median :0.0000000 Mode :character
#> Mean :0.0208333
#> 3rd Qu.:0.0004431
#> Max. :1.0000000
#> SameSex DiffAge DistYard TransSex
#> Length:2256 Length:2256 Min. : 0.000 Length:2256
#> Class :character Class :character 1st Qu.: 1.107 Class :character
#> Mode :character Mode :character Median : 2.860 Mode :character
#> Mean : 23.244
#> 3rd Qu.: 4.419
#> Max. :222.182
In this example, only \(B=1000\) permutations are used. It is recommended to study the stability of the permutations tests results to choose the best number of permutations.
res_equine = permDT (
equineDiffFactors,
ColNameFactor = colnames(equineDiffFactors)[-c(1:3)],
B = 1000,
nclust = 1,
ColNameWeight = "weight",
ColNameRecep = "ID.recep",
ColNameSource = "ID.source",
num_class = "DistYard",
other_class = colnames(equineDiffFactors)[-c(1:3,7)],
multiple_test = TRUE
)
Note that all non-numeric factor are specified in the other_class option.
kbl(res_equine$dt[,c(1,2)], caption = "ALL") %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
stat | pv_diff | |
---|---|---|
SameYard | 0.0500361 | 0.000 |
SameSex | 0.0066463 | 0.042 |
DiffAge | 0.0009605 | 0.789 |
DistYard | -0.2230616 | 0.000 |
TransSex | 0.0419205 | 0.000 |
Factors SameYard, SameSex, DistYard and TransSe are significantly correlated to the transmission probability whereas DiffAg is not.
kbl(res_equine$dt_multi$SameYard[,-c(3:4)], caption = "SameYard") %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
stat | pv_diff | |
---|---|---|
0 - 1 | -444.0667 | 0 |
The test statistics of SameYard and DistYard being negative, horses trained in the same yard or in nearby yards have a higher chance to be linked by a transmission. This is a clearly intuitive result certainly due to higher contact rate in shared training areas.
kbl(res_equine$dt_multi$SameSex[,-c(3:4)], caption = "SameSex") %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
stat | pv_diff | |
---|---|---|
0 - 1 | -24.7739 | 0.041 |
The statistics of post-hoc univariate tests for factor SameSex is also negative, which means that the virus better circulates between horses with the same sex.
kbl(res_equine$dt_multi$TransSex[,-c(3:4)], caption = "TransSex") %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
stat | pv_diff | |
---|---|---|
F->F - F->M | 64.95997 | 0.002 |
F->F - M->F | 111.12337 | 0.000 |
F->F - M->M | 80.01511 | 0.000 |
F->M - M->F | 46.16340 | 0.002 |
F->M - M->M | 15.05515 | 0.319 |
M->F - M->M | -31.10825 | 0.054 |
Moreover, the post-hoc tests on the TransSex modalities show that only the difference between F \(\to\) M - M \(\to\) M (and M \(\to\) F - M \(\to\) M when one considers the corrected p-values) are not significant. The results on the p-values and the sign of the statistics show that transmissions between females are favored compared to all other possible combinations (F \(\to\) F transmissions have positive probabilities 1.8 times more than expected under complete randomness.
First of all, only the factor that are significant is chosen to compute the performance indicator and all the NA has to be removed from the data set.
DT_omitna = na.omit(equineDiffFactors[,c(1:3,4,5,7,8),with=F])
summary(DT_omitna )
#> ID.source ID.recep weight SameYard
#> Length:650 Length:650 Min. :0.000000 Length:650
#> Class :character Class :character 1st Qu.:0.000000 Class :character
#> Mode :character Mode :character Median :0.000000 Mode :character
#> Mean :0.026929
#> 3rd Qu.:0.001019
#> Max. :0.982702
#> SameSex DistYard TransSex
#> Length:650 Min. :0.0000 Length:650
#> Class :character 1st Qu.:0.3463 Class :character
#> Mode :character Median :3.1471 Mode :character
#> Mean :2.6424
#> 3rd Qu.:4.6508
#> Max. :7.3520
max_generations and wait_generations are set to low values because the number of factors is low. It is recommended to increase these values id there is no convergence due for example to a higher number of factor.
system.time({I_max = indicator_max(
DT_omitna,
ColNameFactor = colnames(DT_omitna)[-c(1:3)],
ColNameWeight = "weight",
bounds = c(-10, 10),
max_generations = 50,
hard_limit = TRUE,
wait_generations = 10,
other_class = colnames(DT_omitna)[-c(1:3,6)]
)})
#> user system elapsed
#> 1.670 0.051 1.721
I_max$indicator
#> [,1]
#> [1,] 0.2123575
The performance indicator takes the value \(I_{\hat{{\beta}}}(\mathbb{X},Z)=0.21\) using the four selected factors. This relatively low value, which indicates that there is a moderate correlation between the combination of the four factors and the transmissions, can actually be viewed as quite large given the fact that we only consider very basic factors to \textit{predict} the transmissions.
The proba are the mixture probabilities i.e. the similarity between targets (UE countries) and contributors (US and CA states) in terms of mortality dynamics during the first wave of the Covid-19. We consider 29 variables related to economy, demography, health, healthcare system and climate. More precisely, the objective is to identify factors negatively correlated with the response, i.e., the lower the distance between two geographic entities with respect to a given variable, the higher the mixture probability. Consequently, we use the univariate test for continuous factors, which are computed for each target-contributor pair by \(x_k^{(i,j)}= |x_k^i - x_k^j|\). The data set used for this study is available in https://doi.org/10.5281/zenodo.4769671.
data(diffFactors)
summary(diffFactors)
#> ID.recep ID.source proba pop_diff
#> Length:483 Length:483 Min. :0.000000 Min. : 1760
#> Class :character Class :character 1st Qu.:0.000000 1st Qu.: 2608022
#> Mode :character Mode :character Median :0.000000 Median : 6101159
#> Mean :0.006986 Mean :18246800
#> 3rd Qu.:0.000000 3rd Qu.:30958172
#> Max. :0.999995 Max. :78256584
#> density_diff medianage_diff urbanpop_diff hospibed_diff
#> Min. : 0.6199 Min. : 0.100 Min. : 0.00 Min. :0.000
#> 1st Qu.: 39.8375 1st Qu.: 2.600 1st Qu.: 4.75 1st Qu.:1.100
#> Median : 79.7939 Median : 4.400 Median :10.60 Median :2.400
#> Mean :121.0724 Mean : 4.616 Mean :12.58 Mean :2.683
#> 3rd Qu.:171.3937 3rd Qu.: 6.400 3rd Qu.:18.00 3rd Qu.:4.180
#> Max. :502.6700 Max. :12.600 Max. :41.00 Max. :6.600
#> smokers_diff lung_diff gdp2019_diff healthexp_diff
#> Min. : 0.000 Min. : 0.18 Min. : 4008 Min. : 27
#> 1st Qu.: 5.825 1st Qu.:12.70 1st Qu.: 168604 1st Qu.: 3019
#> Median : 9.600 Median :19.45 Median : 390576 Median : 4711
#> Mean :10.717 Mean :20.17 Mean : 817334 Mean : 8599
#> 3rd Qu.:14.575 3rd Qu.:26.50 3rd Qu.:1257968 3rd Qu.: 6388
#> Max. :31.350 Max. :47.52 Max. :3542655 Max. :70453
#> fertility_diff avghumidity_diff pop_tot_0_14_diff pop_male_0_14_diff
#> Min. :0.0000 Min. : 0.03333 Min. :0.05887 Min. :0.00142
#> 1st Qu.:0.1100 1st Qu.: 3.40000 1st Qu.:1.59546 1st Qu.:1.63400
#> Median :0.2200 Median : 6.11667 Median :3.02960 Median :3.02919
#> Mean :0.2181 Mean : 7.73571 Mean :3.05254 Mean :3.02813
#> 3rd Qu.:0.3200 3rd Qu.: 9.36667 3rd Qu.:4.30214 3rd Qu.:4.21944
#> Max. :0.6304 Max. :43.18333 Max. :8.49047 Max. :8.29937
#> pop_female_0_14_diff pop_tot_15_64_diff pop_male_15_64_diff
#> Min. :0.008173 Min. :0.0179 Min. :0.002948
#> 1st Qu.:1.623634 1st Qu.:0.8045 1st Qu.:0.796625
#> Median :3.035270 Median :1.5992 Median :1.647397
#> Mean :3.077687 Mean :1.8639 Mean :1.897724
#> 3rd Qu.:4.365154 3rd Qu.:2.7271 3rd Qu.:2.838292
#> Max. :8.790995 Max. :5.6677 Max. :6.671684
#> pop_female_15_64_diff pop_tot_65_up_diff pop_male_65_up_diff
#> Min. :0.00907 Min. : 0.0126 Min. :0.01229
#> 1st Qu.:0.99654 1st Qu.: 2.5069 1st Qu.:1.90787
#> Median :1.98855 Median : 4.0096 Median :3.46598
#> Mean :2.16220 Mean : 4.1311 Mean :3.53025
#> 3rd Qu.:3.11756 3rd Qu.: 5.6755 3rd Qu.:4.99559
#> Max. :6.42441 Max. :10.7818 Max. :9.56801
#> pop_female_65_up_diff pop_male_diff pop_female_diff
#> Min. : 0.004197 Min. :0.001009 Min. :0.001009
#> 1st Qu.: 3.130266 1st Qu.:0.291030 1st Qu.:0.291030
#> Median : 4.702279 Median :0.585758 Median :0.585758
#> Mean : 4.808678 Mean :0.720016 Mean :0.720016
#> 3rd Qu.: 6.564720 3rd Qu.:1.052228 3rd Qu.:1.052228
#> Max. :11.880752 Max. :2.987863 Max. :2.987863
#> life_expectancy_diff gdp_capita_diff physicians_per_1K_diff
#> Min. :0.004878 Min. : 92.98 Min. :0.0040
#> 1st Qu.:1.329268 1st Qu.:13382.38 1st Qu.:0.4695
#> Median :2.534146 Median :28834.82 Median :1.0730
#> Mean :2.802106 Mean :29834.47 Mean :1.1275
#> 3rd Qu.:4.113415 3rd Qu.:43230.40 3rd Qu.:1.6685
#> Max. :7.551220 Max. :81392.92 Max. :3.1520
#> nurses_per_1K_diff obesity_diff tmax_diff prec_diff
#> Min. : 0.010 Min. : 0.000 Min. : 0.003259 Min. : 0.1023
#> 1st Qu.: 1.385 1st Qu.: 2.700 1st Qu.: 2.301867 1st Qu.:11.3592
#> Median : 2.760 Median : 4.900 Median : 4.947688 Median :23.5209
#> Mean : 3.329 Mean : 5.265 Mean : 6.138662 Mean :25.5225
#> 3rd Qu.: 4.570 3rd Qu.: 8.000 3rd Qu.: 9.238850 3rd Qu.:37.0643
#> Max. :11.460 Max. :12.600 Max. :22.713560 Max. :79.0801
In this example, only \(B=1000\) permutations are used. It is recommended to study the stability of the permutations tests results to choose the best number of permutations.
res = permDT (
diffFactors,
ColNameFactor = colnames(diffFactors)[-c(1:3)],
B = 1000,
nclust = 1,
ColNameWeight = "proba",
ColNameRecep = "ID.recep",
ColNameSource = "ID.source",
multiple_test = TRUE
)
res$stat2 = res$stat
res$stat2[res$pv_neg>0.05] = NA
res$Factor = str_remove(rownames(res),"_diff")
res$Factor = factor(res$Factor, levels = res$Factor)
res$dec = rep(c(0.15,-0.15),50)[1:nrow(res)]
res$dec[res$pv_neg>0.05] = NA
p = ggplot(res,aes(Factor,pv_neg)) + geom_point(colour = "black") +
ylim(-0.1,1) +
xlab("") +
ylab("P-value") +
theme_classic() +
geom_rangeframe() +
geom_hline(yintercept = 0.05, color ="red") +
#geom_text_repel(data = res, aes(label = round(stat2,2)),
# point.padding = 1) +
geom_label_repel(data = res, aes(label = round(stat2,2)), fill = "white",
nudge_y = na.omit(res$dec) )+
ggtitle("Permutation test results")
p + theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5))
The figure shows the p-values obtained for each factor and the Spearman's correlation for significant factors. We identified eleven impacting factors: hospibed, smokers, lung, healthexp, gdp_capita, fertility, urbanpop, nurses_per_1K, gdp2019, pop_female_0_14, and pop_tot_0_14. The figure shows that Spearman's correlation is negative for significant factors. This result is consistent with our objective (to identify the significant factors negatively correlated to the response).
Only the factor that are significant is chosen to compute the performance indicator. The computation time it is quite long (around seconds).
ind_fact_select = which(colnames(diffFactors) %in% rownames(res[res$pv_neg<0.05,]) )
system.time({I_max = indicator_max(
diffFactors,
ColNameFactor = colnames(diffFactors)[c(ind_fact_select)],
ColNameWeight = "proba",
bounds = c(-10, 10),
max_generations = 200,
hard_limit = TRUE,
wait_generations = 50
)})
I_max$indicator
Then, we applied the multivariate analysis based on the eleven significant factors. The performance indicator is equal to \(I_{\hat{{\beta}}}(\mathbb{X},Z)=0.73\), which shows that a high monotonous dependency exists between the mixture probabilities and these factors.