Introduction

In this vignette, we illustrate our package with the two real data sets used in the article https://hal.archives-ouvertes.fr/hal-02936779v3.

Install and load package

Install packages

install.packages("ZIprop")
install.packages("knitr")
install.packages("kableExtra")
install.packages("ggplot2")
install.packages("ggrepel")
install.packages("ggthemes")
install.packages("stringr")

Note that knitr and kableExtra are used to visualize tables, ggplot2, ggrepel and ggthemes to plot graphics and stringr to manipulate char.

Load packages

suppressPackageStartupMessages(library(ZIprop))
library(knitr)
library(kableExtra)
library(ggplot2)
library(ggrepel)
library(ggthemes)
library(stringr)

Equine Influenza

We consider an Equine Influenza outbreak in 2003 in race horses from different training yards in Newmarket. In what follows, we use four discrete factors and one continuous factor computed from the observed variables age, sex and training yard:

SameYard: \(1\) if the target and contributor are trained in the same yard, \(0\) otherwise;
SameSex: \(1\) if the target and contributor have the same sex, \(0\) otherwise;
DiffAge: \(0\) if the target and contributor have the same age, \(1\) for a one-year difference and \(2\) for more than one year;
DistYard: geographic distance (in km) between the training yards of the target and the contributor;
TransSex: F \(\to\) F if a female infected another female, M \(\to\) F if a male infected a female, F \(\to\) M if a female infected a male and M \(\to\) M if a male infected another male.

The weight is the probability of transmission between horses.

Some of these factors are missing for some target-contributor pairs. Hence, the tests for assessing the effect of a given factor on the transmission probability are applied on the subset of complete data for this factor. The data set used for this study is available in the package or at this link https://doi.org/10.5281/zenodo.4837560.

Load the data and summarize them.

data(equineDiffFactors)
summary(equineDiffFactors)
#>   ID.source           ID.recep             weight            SameYard        
#>  Length:2256        Length:2256        Min.   :0.0000000   Length:2256       
#>  Class :character   Class :character   1st Qu.:0.0000000   Class :character  
#>  Mode  :character   Mode  :character   Median :0.0000000   Mode  :character  
#>                                        Mean   :0.0208333                     
#>                                        3rd Qu.:0.0004431                     
#>                                        Max.   :1.0000000                     
#>    SameSex            DiffAge             DistYard         TransSex        
#>  Length:2256        Length:2256        Min.   :  0.000   Length:2256       
#>  Class :character   Class :character   1st Qu.:  1.107   Class :character  
#>  Mode  :character   Mode  :character   Median :  2.860   Mode  :character  
#>                                        Mean   : 23.244                     
#>                                        3rd Qu.:  4.419                     
#>                                        Max.   :222.182

Run permutations test

In this example, only \(B=1000\) permutations are used. It is recommended to study the stability of the permutations tests results to choose the best number of permutations.

res_equine = permDT (
  equineDiffFactors,
  ColNameFactor = colnames(equineDiffFactors)[-c(1:3)],
  B = 1000,
  nclust = 1,
  ColNameWeight = "weight",
  ColNameRecep = "ID.recep",
  ColNameSource = "ID.source",
  num_class = "DistYard",
  other_class = colnames(equineDiffFactors)[-c(1:3,7)],
  multiple_test = TRUE
)

Note that all non-numeric factor are specified in the other_class option.

Permutations tests results

kbl(res_equine$dt[,c(1,2)], caption = "ALL") %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

ALL
	stat	pv_diff
SameYard	0.0500361	0.000
SameSex	0.0066463	0.042
DiffAge	0.0009605	0.789
DistYard	-0.2230616	0.000
TransSex	0.0419205	0.000

Factors SameYard, SameSex, DistYard and TransSe are significantly correlated to the transmission probability whereas DiffAg is not.

kbl(res_equine$dt_multi$SameYard[,-c(3:4)], caption = "SameYard") %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

SameYard
	stat	pv_diff
0 - 1	-444.0667	0

The test statistics of SameYard and DistYard being negative, horses trained in the same yard or in nearby yards have a higher chance to be linked by a transmission. This is a clearly intuitive result certainly due to higher contact rate in shared training areas.

kbl(res_equine$dt_multi$SameSex[,-c(3:4)], caption = "SameSex") %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

SameSex
	stat	pv_diff
0 - 1	-24.7739	0.041

The statistics of post-hoc univariate tests for factor SameSex is also negative, which means that the virus better circulates between horses with the same sex.

kbl(res_equine$dt_multi$TransSex[,-c(3:4)], caption = "TransSex") %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

TransSex
	stat	pv_diff
F->F - F->M	64.95997	0.002
F->F - M->F	111.12337	0.000
F->F - M->M	80.01511	0.000
F->M - M->F	46.16340	0.002
F->M - M->M	15.05515	0.319
M->F - M->M	-31.10825	0.054

Moreover, the post-hoc tests on the TransSex modalities show that only the difference between F \(\to\) M - M \(\to\) M (and M \(\to\) F - M \(\to\) M when one considers the corrected p-values) are not significant. The results on the p-values and the sign of the statistics show that transmissions between females are favored compared to all other possible combinations (F \(\to\) F transmissions have positive probabilities 1.8 times more than expected under complete randomness.

Performance indicator

First of all, only the factor that are significant is chosen to compute the performance indicator and all the NA has to be removed from the data set.

DT_omitna = na.omit(equineDiffFactors[,c(1:3,4,5,7,8),with=F])
summary(DT_omitna )
#>   ID.source           ID.recep             weight           SameYard        
#>  Length:650         Length:650         Min.   :0.000000   Length:650        
#>  Class :character   Class :character   1st Qu.:0.000000   Class :character  
#>  Mode  :character   Mode  :character   Median :0.000000   Mode  :character  
#>                                        Mean   :0.026929                     
#>                                        3rd Qu.:0.001019                     
#>                                        Max.   :0.982702                     
#>    SameSex             DistYard        TransSex        
#>  Length:650         Min.   :0.0000   Length:650        
#>  Class :character   1st Qu.:0.3463   Class :character  
#>  Mode  :character   Median :3.1471   Mode  :character  
#>                     Mean   :2.6424                     
#>                     3rd Qu.:4.6508                     
#>                     Max.   :7.3520

max_generations and wait_generations are set to low values because the number of factors is low. It is recommended to increase these values id there is no convergence due for example to a higher number of factor.

system.time({I_max = indicator_max(
  DT_omitna,
  ColNameFactor = colnames(DT_omitna)[-c(1:3)],
  ColNameWeight = "weight",
  bounds = c(-10, 10),
  max_generations = 50,
  hard_limit = TRUE,
  wait_generations = 10,
  other_class = colnames(DT_omitna)[-c(1:3,6)]
)})
#>    user  system elapsed 
#>   1.670   0.051   1.721

I_max$indicator
#>           [,1]
#> [1,] 0.2123575

The performance indicator takes the value \(I_{\hat{{\beta}}}(\mathbb{X},Z)=0.21\) using the four selected factors. This relatively low value, which indicates that there is a moderate correlation between the combination of the four factors and the transmissions, can actually be viewed as quite large given the fact that we only consider very basic factors to \textit{predict} the transmissions.

Covid-19

The proba are the mixture probabilities i.e. the similarity between targets (UE countries) and contributors (US and CA states) in terms of mortality dynamics during the first wave of the Covid-19. We consider 29 variables related to economy, demography, health, healthcare system and climate. More precisely, the objective is to identify factors negatively correlated with the response, i.e., the lower the distance between two geographic entities with respect to a given variable, the higher the mixture probability. Consequently, we use the univariate test for continuous factors, which are computed for each target-contributor pair by \(x_k^{(i,j)}= |x_k^i - x_k^j|\). The data set used for this study is available in https://doi.org/10.5281/zenodo.4769671.

Load the data and summarize them.

data(diffFactors)
summary(diffFactors)
#>    ID.recep          ID.source             proba             pop_diff       
#>  Length:483         Length:483         Min.   :0.000000   Min.   :    1760  
#>  Class :character   Class :character   1st Qu.:0.000000   1st Qu.: 2608022  
#>  Mode  :character   Mode  :character   Median :0.000000   Median : 6101159  
#>                                        Mean   :0.006986   Mean   :18246800  
#>                                        3rd Qu.:0.000000   3rd Qu.:30958172  
#>                                        Max.   :0.999995   Max.   :78256584  
#>   density_diff      medianage_diff   urbanpop_diff   hospibed_diff  
#>  Min.   :  0.6199   Min.   : 0.100   Min.   : 0.00   Min.   :0.000  
#>  1st Qu.: 39.8375   1st Qu.: 2.600   1st Qu.: 4.75   1st Qu.:1.100  
#>  Median : 79.7939   Median : 4.400   Median :10.60   Median :2.400  
#>  Mean   :121.0724   Mean   : 4.616   Mean   :12.58   Mean   :2.683  
#>  3rd Qu.:171.3937   3rd Qu.: 6.400   3rd Qu.:18.00   3rd Qu.:4.180  
#>  Max.   :502.6700   Max.   :12.600   Max.   :41.00   Max.   :6.600  
#>   smokers_diff      lung_diff      gdp2019_diff     healthexp_diff 
#>  Min.   : 0.000   Min.   : 0.18   Min.   :   4008   Min.   :   27  
#>  1st Qu.: 5.825   1st Qu.:12.70   1st Qu.: 168604   1st Qu.: 3019  
#>  Median : 9.600   Median :19.45   Median : 390576   Median : 4711  
#>  Mean   :10.717   Mean   :20.17   Mean   : 817334   Mean   : 8599  
#>  3rd Qu.:14.575   3rd Qu.:26.50   3rd Qu.:1257968   3rd Qu.: 6388  
#>  Max.   :31.350   Max.   :47.52   Max.   :3542655   Max.   :70453  
#>  fertility_diff   avghumidity_diff   pop_tot_0_14_diff pop_male_0_14_diff
#>  Min.   :0.0000   Min.   : 0.03333   Min.   :0.05887   Min.   :0.00142   
#>  1st Qu.:0.1100   1st Qu.: 3.40000   1st Qu.:1.59546   1st Qu.:1.63400   
#>  Median :0.2200   Median : 6.11667   Median :3.02960   Median :3.02919   
#>  Mean   :0.2181   Mean   : 7.73571   Mean   :3.05254   Mean   :3.02813   
#>  3rd Qu.:0.3200   3rd Qu.: 9.36667   3rd Qu.:4.30214   3rd Qu.:4.21944   
#>  Max.   :0.6304   Max.   :43.18333   Max.   :8.49047   Max.   :8.29937   
#>  pop_female_0_14_diff pop_tot_15_64_diff pop_male_15_64_diff
#>  Min.   :0.008173     Min.   :0.0179     Min.   :0.002948   
#>  1st Qu.:1.623634     1st Qu.:0.8045     1st Qu.:0.796625   
#>  Median :3.035270     Median :1.5992     Median :1.647397   
#>  Mean   :3.077687     Mean   :1.8639     Mean   :1.897724   
#>  3rd Qu.:4.365154     3rd Qu.:2.7271     3rd Qu.:2.838292   
#>  Max.   :8.790995     Max.   :5.6677     Max.   :6.671684   
#>  pop_female_15_64_diff pop_tot_65_up_diff pop_male_65_up_diff
#>  Min.   :0.00907       Min.   : 0.0126    Min.   :0.01229    
#>  1st Qu.:0.99654       1st Qu.: 2.5069    1st Qu.:1.90787    
#>  Median :1.98855       Median : 4.0096    Median :3.46598    
#>  Mean   :2.16220       Mean   : 4.1311    Mean   :3.53025    
#>  3rd Qu.:3.11756       3rd Qu.: 5.6755    3rd Qu.:4.99559    
#>  Max.   :6.42441       Max.   :10.7818    Max.   :9.56801    
#>  pop_female_65_up_diff pop_male_diff      pop_female_diff   
#>  Min.   : 0.004197     Min.   :0.001009   Min.   :0.001009  
#>  1st Qu.: 3.130266     1st Qu.:0.291030   1st Qu.:0.291030  
#>  Median : 4.702279     Median :0.585758   Median :0.585758  
#>  Mean   : 4.808678     Mean   :0.720016   Mean   :0.720016  
#>  3rd Qu.: 6.564720     3rd Qu.:1.052228   3rd Qu.:1.052228  
#>  Max.   :11.880752     Max.   :2.987863   Max.   :2.987863  
#>  life_expectancy_diff gdp_capita_diff    physicians_per_1K_diff
#>  Min.   :0.004878     Min.   :   92.98   Min.   :0.0040        
#>  1st Qu.:1.329268     1st Qu.:13382.38   1st Qu.:0.4695        
#>  Median :2.534146     Median :28834.82   Median :1.0730        
#>  Mean   :2.802106     Mean   :29834.47   Mean   :1.1275        
#>  3rd Qu.:4.113415     3rd Qu.:43230.40   3rd Qu.:1.6685        
#>  Max.   :7.551220     Max.   :81392.92   Max.   :3.1520        
#>  nurses_per_1K_diff  obesity_diff      tmax_diff           prec_diff      
#>  Min.   : 0.010     Min.   : 0.000   Min.   : 0.003259   Min.   : 0.1023  
#>  1st Qu.: 1.385     1st Qu.: 2.700   1st Qu.: 2.301867   1st Qu.:11.3592  
#>  Median : 2.760     Median : 4.900   Median : 4.947688   Median :23.5209  
#>  Mean   : 3.329     Mean   : 5.265   Mean   : 6.138662   Mean   :25.5225  
#>  3rd Qu.: 4.570     3rd Qu.: 8.000   3rd Qu.: 9.238850   3rd Qu.:37.0643  
#>  Max.   :11.460     Max.   :12.600   Max.   :22.713560   Max.   :79.0801

Run permutations test

In this example, only \(B=1000\) permutations are used. It is recommended to study the stability of the permutations tests results to choose the best number of permutations.

res = permDT (
  diffFactors,
  ColNameFactor = colnames(diffFactors)[-c(1:3)],
  B = 1000,
  nclust = 1,
  ColNameWeight = "proba",
  ColNameRecep = "ID.recep",
  ColNameSource = "ID.source",
  multiple_test = TRUE
)

Permutations tests results

res$stat2 = res$stat
res$stat2[res$pv_neg>0.05] = NA
res$Factor = str_remove(rownames(res),"_diff")
res$Factor = factor(res$Factor, levels = res$Factor)
res$dec = rep(c(0.15,-0.15),50)[1:nrow(res)]
res$dec[res$pv_neg>0.05] = NA

p = ggplot(res,aes(Factor,pv_neg)) + geom_point(colour = "black") + 
  ylim(-0.1,1) +
  xlab("") + 
  ylab("P-value") +
  theme_classic() +
  geom_rangeframe() + 
  geom_hline(yintercept = 0.05, color ="red") +
  #geom_text_repel(data = res, aes(label = round(stat2,2)),
  #                         point.padding = 1)  +
  geom_label_repel(data = res, aes(label = round(stat2,2)), fill = "white", 
                   nudge_y = na.omit(res$dec) )+
  ggtitle("Permutation test results")

p + theme(axis.text.x = element_text(angle = 45, hjust = 1),
          plot.title = element_text(hjust = 0.5))

plot of chunk unnamed-chunk-13

The figure shows the p-values obtained for each factor and the Spearman's correlation for significant factors. We identified eleven impacting factors: hospibed, smokers, lung, healthexp, gdp_capita, fertility, urbanpop, nurses_per_1K, gdp2019, pop_female_0_14, and pop_tot_0_14. The figure shows that Spearman's correlation is negative for significant factors. This result is consistent with our objective (to identify the significant factors negatively correlated to the response).

Performance indicator

Only the factor that are significant is chosen to compute the performance indicator. The computation time it is quite long (around seconds).

ind_fact_select = which(colnames(diffFactors) %in%  rownames(res[res$pv_neg<0.05,]) )

system.time({I_max = indicator_max(
  diffFactors,
  ColNameFactor = colnames(diffFactors)[c(ind_fact_select)],
  ColNameWeight = "proba",
  bounds = c(-10, 10),
  max_generations = 200,
  hard_limit = TRUE,
  wait_generations = 50
)})

I_max$indicator

Then, we applied the multivariate analysis based on the eleven significant factors. The performance indicator is equal to \(I_{\hat{{\beta}}}(\mathbb{X},Z)=0.73\), which shows that a high monotonous dependency exists between the mixture probabilities and these factors.