Abstract
Association measures can be local or global. Local association measures quantify the association between specific values of random variables (e.g. chi-squared residuals). Global association measures yield a single value used to summarize the association for all values taken by random variables (e.g. chi-squared). Classical data analysis has focused on global association and overlooked local association. Consequently, software presently available only allows computation of global association measures. Nonetheless, a significant global association can hide a non-significant local association, and a non-significant global association can hide a significant local association.zebu
R package allows estimation of local association measures and implements local association subgroup analysis. It is of interest to a wide range of scientific disciplines such as health and computer sciences and can be used by anyone with a basic knowledge of the R language. It is available in the CRAN and its source code is available at https://github.com/oliviermfmartin/zebu.Association measures can be local or global (Van de Cruys 2011). Local association measures quantify the association between specific values of random variables. In the case of a contingency table, they yield one value for each cell. An example is chi-squared residuals that are computed when constructing a chi-squared test. On the other hand, global association measures yield a single value used to summarize the association for all values taken by random variables. An example is the chi-squared statistic, the sum of squared residuals (Sheskin 2007).
Most often, we are only concerned with the global association and overlook local association. For example, analysis of chi-squared residuals is uncommon practice when compared to the chi-square independence test. Nonetheless, a significant global association can hide a non-significant local association, and a non-significant global association can hide a significant local association (Anselin 1995). Accordingly, analysis of association should not limit itself to the global perspective. Indeed, the association between two variables can depend on their values. For example, in threshold mechanisms, variables are only associated with each other when one takes values above a certain critical level. In this case, local association measures allow pinpointing values for which variables are associated. Moreover, the existence of an association between two variables may depend on the value of a third variable. For example, the effect of a drug will depend on the patient’s sensibility to the drug. The local association between drug intake and recovery will not be the same for patients that are sensitive than for those that are resistant to the drug.
The rest of the vignette is organized as follows. We first give the reader the necessary intuition and mathematical background about global and local associations. This leads to the description of chi-square residuals, Lewontin’s \(D\) (Lewontin 1964), Ducher’s \(Z\) (Ducher et al. 1994), and pointwise mutual information (Van de Cruys 2011). We also introduce a multivariate and normalized measure of local association. Subsequently, we illustrate the usage of local association measures using the zebu
R package. The vignette ends with a discussion about future development and research.
Throughout the vignette, we will suppose that all random variables are discrete and write them in capital letters, such as \(A\) and \(B\). Lower letters, such as \(a\) and \(b\), will denote possible values taken by these random variables (i.e. events).
One way to think about a statistical association is as events co-occurring. For example, if event \(a\) always occurs with event \(b\), then these events are said to be associated. An intuitive measure of association could be the joint probability: \(p(a, b)\), the long-term frequency of events showing up together. However, this measure fails if \(a\) or \(b\) is a rare event. Indeed, joint probabilities are always as small as its individual events are rare: \(p(a, b) \leq \min p(a), p(b)\). As a consequence, it is necessary to compare observed probabilities \(p(a, b)\) to expected probabilities in which the variables are considered independent. The expected probability, if events are independent, is the factor of marginalized probabilities of events: \(p(a) \, p(b)\). Independence is then defined by the following mathematical relation, \(p(a, b) = p(a) \, p(b)\), and local association measures are defined to be equal to zero.
Independence implies that knowing one or more variables does not give us any information about the others. This is what we are not interested in. It is, however, possible to define two cases where the former equality does not hold: co-occurrence and mutual exclusivity. Co-occurrence is defined as events showing up more often than expected: \(p(a, b) > p(a) \, p(b)\) and local association measures are positive. Mutual exclusivity is defined as events showing up less often than expected: \(p(a, b) < p(a) \, p(b)\) and local association measures are negative.
Statistical independence is, however, not the only manner to construct an association measure. Other possibilities are based on the proportion of explained variance such as Pearson’s r. These former measures are parametric and suppose linear or at least monotone relationships between variables. Although intuitive and convenient, this assumption is not always justified. Measures based on statistical independence provide a non-parametric alternative that can detect non-linear relationships.
For two random variables, \(A\) and \(B\), we can estimate the local association for each combination of events \(A = a\) and \(B = b\). This is accomplished by comparing the observed from the expected probability of events \(a\) and \(b\). If these probabilities are equal, then events \(a\) and \(b\) are independent. If not, these events are associated; the sign of the measure indicates the orientation of the relationship, and the absolute value indicates its strength.
There are different measures to compare observed and expected probabilities, for example, by using subtraction and division. Hereunder, we define the difference or Lewontin’s \(D\) (Lewontin 1964) and the pointwise mutual information \(pmi\) (Van de Cruys 2011). To simplify notation, and to show similarities between local association measures, we define \(h(a) = - \log p(a)\) as the self-information of \(a\).
\[ \begin{aligned} D(a, b) & = p(a, b) - p(a) \, p(b) \\ pmi(a, b) & = \log \frac{p(a, b)} {p(a) p(b)} = - (h(a, b) - h(a) - h(b)) \end{aligned} \]
The bounds of these two measures depend on the frequency of events, which makes it difficult to compare local association values for different combinations of events. For this reason, it is desirable to express association relative to the frequency of events. One common way to do this is using chi-squared residuals \(r_{\chi}\) as follows where \(N\) is the sample size.
\[ r_{\chi}(a,b) = \sqrt{N} \; \frac{p(a, b) - p(a) \, p(b)}{\sqrt{p(a) \, p(b)}} \]
We may wish to normalize local association so that values are between -1 and 1 included. This can be done by using dividing the non-normalized values by their minimal or maximal values. To identify the theoretical minimal and maximal values of \(D\), we will find the bounds of the observed bivariate probability \(p(a, b)\) as a function of the marginal probabilities \(p(a)\) and \(p(b)\).
Using the inclusion-exclusion principle, we know that:
\[ p(a, b) = p(a \cap b) = p(a) + p(b) - p(a \cup b) \]
The intersection probability \(p(a, b)\) will be maximized when the union probability \(p(a \cup b)\) is equal to zero and minimized when the union probability will be equal to one.
\[ p(a) + p(b) - 1 \le p(a, b) \le p(a) + p(b) \]
Given the intersection probability can not be smaller than zero and can not be larger than the smallest marginal probability, we have:
\[ \max[0, \, p(a) + p(b) - 1] \le p(a, b) \le min[p(a), \, p(b)] \]
Using this result, we can divide \(D\) by its theoretical minimal or maximal value which leads to Lewontin’s \(D'\) (Lewontin 1964). However, this removes the sign of the association. To preserve the sign, in the case where Lewontin’s \(D\) is negative, we divide it by the negative theoretical minimal value. We call this measure Ducher’s \(Z\). It should however be noted that the original definition of Ducher’s \(Z\) did not probably consider the lower bound of the intersection probability as we do here (Ducher et al. 1994).
\[ Z(a, b) = \begin{cases} \frac{ p(a, b) - p(a) \, p(b) }{ \min[p(a), \, p(b)] - p(a) \, p(b) } & D(a, b) > 0 \\ \\ \frac{ p(a, b) - p(a) \, p(b) }{p(a) p(b) - \max[0, \, p(a) + p(b) - 1]} & D(a, b) < 0 \\ \\ 0 & D(a, b) = 0 \end{cases} \]
Normalization of case of \(pmi\) is more subtle because \(pmi(a, b)\) tends to \(\infty\) when \(p(a, b)\) tends to 0. Nonetheless, dividing \(pmi(a, b)\) by \(- h(a, b)\) solves this problem by making \(npmi(a, b)\) tend to -1 when \(p(a, b)\) tends to 0 and equal to 1 when \(p(a, b) = \min[p(a), p(b)]\) (Bouma 2009).
\[ npmi(a, b) = \frac{ pmi(a, b) }{- h(a, b) } = \frac{ h(a, b) - h(a) - h(b) }{ h(a, b) } \]
The zebu
package includes a function called lassie
allowing estimation of Lewontin’s \(D\), Ducher’s \(Z\), \(pmi\), \(npmi\), and \(r_{\chi}\).
Global association measures yield a single value used to summarize the association for all values taken by the random variables. For example, mutual information is computed as the sum for all events of their observed probability times their pointwise mutual information. Most global association measures in zebu
are defined likewise.
\[ \begin{aligned} GD(A, B) &= \sum_{a, b} p(a, b) D(a, b) \\ GZ(A, B) &= \sum_{a, b} p(a, b) Z(a, b) \\ MI(A, B) &= \sum_{a, b} p(a, b) pmi(a, b) \\ NMI(A, B) &= \sum_{a, b} p(a, b) npmi(a, b) \\ \end{aligned} \]
The global association measure related to chi-squared residuals is the chi-squared \(\chi^2\). It is defined as the sum of its squared residuals.
\[ \chi^2 = \sum_{a, b} r_{\chi}(a,b)^2 \]
Distinguishing the strength of association from its statistical significance is important. Indeed, a strong association can be non-significant (e.g. some physical law with small sample size) and a weak association can be significant (e.g. epidemiological risk factor with big sample size). Significance can be accessed using p-values estimated using the theoretical null distribution or by resampling techniques (Sheskin 2007). Because the theoretical null distribution of local association measures is unknown, the zebu
package resorts to estimating p-values by a permutation test. This can be undertaken using the permtest
function of the package.
The null hypothesis \(H_0\) being tested is that the association measure \(L\) is equal to 0, that is, there is no association. The observed association is \(L_{obs}\) and the permuted associations are denoted by the set \(L_{perm}\). Moreover, we write \(\#(\ldots)\) as the number of times and \(|\ldots|\) as the absolute value. The two-sided p-value can then be estimated as follows.
\[ p = \frac{\#(|L_{obs}| < |L_{perm}|)}{\#(L_{perm})} \]
For local association measures, this results in conducting a series of statistical tests. It is thus advised to apply multiple testing corrections, such as the one advocated by Benjamini-Hochberg.
Multivariate association measures may help identify complex association relationships that cannot be detected only with bivariate association measures. For example, in the XOR gate, the output of the gate is not associated with any of the two inputs individually (Jakulin and Bratko 2003). The association is only revealed when the two inputs and the output are taken together.
To derive multivariate forms of these local association measures, we assume that events are mutually independent. This means that for \(M\) random variables \(X_1, \ldots, X_M\), independence is defined by: \(p(x_1, \ldots, x_M) = \prod_{i=1}^{M} p(x_i)\). We can thus define the following measures.
\[ \begin{aligned} D(x_1, \ldots, x_M) & = p(x_1, \ldots, x_M) - \prod_{i=1}^{M} p(x_i) \\ pmi(x_1, \ldots, x_M) & = - [h(x_1, \ldots, x_M) - \sum_{i=1}^{M} h(x_i) ] \end{aligned} \]
By dividing \(D(x_1, \ldots, x_M)\) by the expected probability we obtain multivariate chi-squared residuals
\[ r_{\chi}(x_1, \ldots, x_M) = \sqrt{N} \; \frac{ p(x_1, \ldots, x_M) - \prod_{i=1}^{M} p(x_i) }{ \sqrt{\prod_{i=1}^{M} p(x_i)} } \]
To obtain a multivariate measure of Ducher’s \(Z\), we need to find the bounds of the observed probability \(p(x_1, \ldots, x_M)\). We know that the upper bound will be the minimal marginal probability. Additionally, we find a formula to express the lower bound (proof at the end of the section). This leads to the following bounds.
\[ \max[0, -M - 1 + \sum_{i=1}^M p(x_i)] \le p(x_1, \ldots, x_M) \le \min[x_1, \ldots, x_M]$. \]
We thus propose the following multivariate form of Ducher’s \(Z\).
\[ Z(x_1, \ldots, x_M) = \begin{cases} \frac{ p(x_1, \ldots, x_M) - \prod_{i=1}^{M} p(x_i) }{ \min[p(x_1), \ldots, p(x_M)] - \prod_{i=1}^{M} p(x_i) } & D(x_1, \ldots, x_M) > 0 \\ \\ \frac{ p(x_1, \ldots, x_M) - \prod_{i=1}^{M} p(x_i) }{\prod_{i=1}^{M} p(x_i)- \max[0, - M - 1 + \sum_{i=1}^M p(x_i)]} & D(x_1, \ldots, x_M) < 0 \\ \\ 0 & D(x_1, \ldots, x_M) = 0 \end{cases} \]
For pointwise mutual information, the normalization technique suggested by Bouma (2009) is not bounded by 1 for more than two variables. To solve this, we suggest the following normalization scheme which we call \(npmi_2\).
\[ npmi_2(x_1, \ldots, x_M) = \begin{cases} \frac{ h(x_1, \ldots, x_M) - \sum_{i=1}^{M} h(x_i) }{ \min[h(x_1), \ldots, h(x_M)] - \sum_{i=1}^{M} h(x_i) } & pmi(x_1, \ldots, x_M) > 0 \\ \\ \frac{ h(x_1, \ldots, x_M) - \sum_{i=1}^{M} h(x_i) }{h(x_1, \ldots, x_M)} & pmi(x_1, \ldots, x_M) < 0 \\ \\ 0 & pmi(x_1, \ldots, x_M) = 0 \end{cases} \]
Using induction and the inclusion-exclusion principle we give a formula for the lower bound of the observed intersection probability of \(M\) events.
\[ \min[ p(x_1, \ldots, x_M) ] = \max[0, -M - 1 + \sum_{i=1}^M p(x_i)] \]
We first show that this is true for the base case \(M=2\). In this case, \(p(x_1) + p(x_2) -1 \le p(x_1, x_2)\). We proved this using the inclusion-exclusion principle in the section where we derive a bivariate form of Ducher’s Z.
We now show that the induction step is true: let’s assume that this formula is true for \(M\) variables. For \(M+1\) variables, the inclusion-exclusion principle tells us that:
\[ \begin{align} p(x_1, \ldots, x_{M+1}) &= p(\{\cap_{i=1}^M x_i \} \cap x_{M+1}) \\ &= p(\cap_{i=1}^M x_i) + p(x_{M+1}) - p(\{\cap_{i=1}^M x_M \} \cup x_{M+1}) \\ \end{align} \]
We assumed that the lower bound \(p(\cap_{i=1}^M x_i)\) is \(-M - 1 + \sum_{i=1}^M p(x_i)\) and we know that the upper bound of \(p(\{\cap_{i=1}^M x_M \} \cup x_{M+1})\) is one. Replacing these values in the last line leads to
\[ \min[ p(x_1, \ldots, x_{M+1}) ] = -(M+1) - 1 + \sum_{i=1}^{M+1} p(x_i) \]
However, given that probabilities can not be lower than zero, we can write the following equation which completes the proof.
\[ \min[ p(x_1, \ldots, x_{M+1}) ] = \max[0, -(M+1) - 1 + \sum_{i=1}^{M+1} p(x_i)] \]
Once R is installed, the first step is to install the zebu
package. You can install the released version from CRAN
install.packages("zebu")
We can then load the zebu
R package.
library(zebu)
To illustrate local association measures and the zebu
package, we’ll simulate a small culinary dataset.
Each row corresponds to a client of a restaurant. We record the choices made by the client. There are three choices for each of the plates: starters, main dish, and dessert.
The clients take the starter with equal probability. The choice of the following dish depends only on the previous dish. The clients tend to avoid a dish with an ingredient that had in the previous dish. We define these probabilities hereunder.
starter_prob <- c("Tomato Mozzarella Salad" = 1/3, "Rice Tuna Salad" = 1/3, "Lentil Salad" = 1/3)
starter_prob
Tomato Mozzarella Salad Rice Tuna Salad Lentil Salad
0.3333333 0.3333333 0.3333333
main_given_starter_prob <- matrix(c(5/11, 1/11, 5/11,
5/11, 5/11, 1/10,
1/11, 5/11, 5/11),
3, 3, byrow = TRUE)
rownames(main_given_starter_prob) <- names(starter_prob)
colnames(main_given_starter_prob) <- c("Sausage and Lentil Stew", "Pizza Margherita", "Pilaf Rice")
main_given_starter_prob
Sausage and Lentil Stew Pizza Margherita Pilaf Rice
Tomato Mozzarella Salad 0.45454545 0.09090909 0.4545455
Rice Tuna Salad 0.45454545 0.45454545 0.1000000
Lentil Salad 0.09090909 0.45454545 0.4545455
dessert_given_main <- matrix(c(2/6, 2/6, 2/6,
7/12, 1/12, 2/6,
1/12, 7/12, 2/6),
3, 3, byrow = TRUE)
rownames(dessert_given_main) <- colnames(main_given_starter_prob)
colnames(dessert_given_main) <- c("Rice Pudding", "Apple Pie", "Fruit Salad")
dessert_given_main
Rice Pudding Apple Pie Fruit Salad
Sausage and Lentil Stew 0.33333333 0.33333333 0.3333333
Pizza Margherita 0.58333333 0.08333333 0.3333333
Pilaf Rice 0.08333333 0.58333333 0.3333333
We now simulate a dataset of 1000 clients given these probabilities.
set.seed(0)
sample_size <- 1000
df <- t(sapply(seq_len(sample_size), function(i) {
starter <- sample(names(starter_prob), size = 1, prob = starter_prob)
main <- sample(colnames(main_given_starter_prob), size = 1, prob = main_given_starter_prob[starter, ])
dessert <- sample(colnames(dessert_given_main), size = 1, prob = dessert_given_main[main, ])
c(Starter = starter, Main = main, Dessert = dessert)
}))
df <- as.data.frame(df)
head(df)
Starter Main Dessert
1 Tomato Mozzarella Salad Sausage and Lentil Stew Fruit Salad
2 Lentil Salad Pilaf Rice Apple Pie
3 Tomato Mozzarella Salad Pizza Margherita Fruit Salad
4 Lentil Salad Pizza Margherita Rice Pudding
5 Rice Tuna Salad Sausage and Lentil Stew Fruit Salad
6 Tomato Mozzarella Salad Pilaf Rice Fruit Salad
We look at the contingency table.
table(df)
, , Dessert = Apple Pie
Main
Starter Pilaf Rice Pizza Margherita Sausage and Lentil Stew
Lentil Salad 77 17 13
Rice Tuna Salad 14 14 44
Tomato Mozzarella Salad 81 2 47
, , Dessert = Fruit Salad
Main
Starter Pilaf Rice Pizza Margherita Sausage and Lentil Stew
Lentil Salad 54 48 17
Rice Tuna Salad 7 49 47
Tomato Mozzarella Salad 39 11 58
, , Dessert = Rice Pudding
Main
Starter Pilaf Rice Pizza Margherita Sausage and Lentil Stew
Lentil Salad 11 95 15
Rice Tuna Salad 3 101 52
Tomato Mozzarella Salad 13 12 59
The local (and global) association between these variables can be estimated using the lassie
function. This function takes at least one argument: a data.frame
. Columns are selected using the select
arguments (column names or numbers). Variables are assumed to be categorical; continuous variables have to be specified using the continuous
argument and the number of discretization bins with the breaks
argument (as in the cut
function). The local association measure that we use here is Ducher’s Z as specified by setting the measure
argument equal to "z"
. We’ll look at the relationship between the main dish and the dessert.
las <- lassie(df, select = c("Main", "Dessert"), measure = "z")
The permtest
function accesses the significance of local (and global) association using a permutation test. The number of iterations is specified by nb
and the adjustment method of p-values for multiple comparisons by p_adjust
(as in the p.adjust
function).
las <- permtest(las,
nb = 5000,
p_adjust = "BH")
The lassie
and permtest
functions return a lassie
S3 object, as well as permtest
for permtest
. lassie
objects can be visualized using the plot
and print
methods. The plot
function returns a heatmap with the local association and p-values displayed between parenthesis.
print(las)
Measure: Ducher's Z
Global: 0.0912667026105455 (p-value: <1/5000)
Dessert
Main Apple Pie Fruit Salad Rice Pudding
Pilaf Rice 0.38531235 0.006639046 -0.749858716
Pizza Margherita -0.69399394 -0.062255796 0.367744192
Sausage and Lentil Stew -0.04383642 0.027310138 -0.008436162
plot(las)
Results can be saved in CSV format using write.lassie
. To access the documentation of these functions, please type help("print.lassie")
, help("plot.lassie")
and help(write.lassie)
in the R console.
In this example, we can see that the global association between the two variables is statistically significant. However, we see that there is only a local association between these two variables only for certain combinations. More specifically, knowing that a client took the lentils and sausages does not convey information about what dessert s/he will take, and likewise for the fruit salad. A significant global association can hide a non-significant local association.
The number of variables that can be handled in the zebu
package is not limited. We can use the lassie
function to access the association between the three simulated variables.
In this case, we obtain a multidimensional local association array
. Because of this, results cannot be plotted as a tile plot; the plot
method is not available. The print
method allows visualizing results by melting the array
into a data.frame
, by default sorted by decreasing local association.
las2 <- lassie(df, measure = "z")
las2 <- permtest(las2, nb = 5000)
print(las2, what_sort = "local_p", decreasing = FALSE)
Measure: Ducher's Z
Global: -0.00796166078557106 (p-value: 1)
Starter Main Dessert local
4 Lentil Salad Pizza Margherita Apple Pie -0.54570753
5 Rice Tuna Salad Pizza Margherita Apple Pie -0.60779228
6 Tomato Mozzarella Salad Pizza Margherita Apple Pie -0.94240428
7 Lentil Salad Sausage and Lentil Stew Apple Pie -0.65556067
11 Rice Tuna Salad Pilaf Rice Fruit Salad -0.78566904
15 Tomato Mozzarella Salad Pizza Margherita Fruit Salad -0.70338204
16 Lentil Salad Sausage and Lentil Stew Fruit Salad -0.57824247
19 Lentil Salad Pilaf Rice Rice Pudding -0.70631293
20 Rice Tuna Salad Pilaf Rice Rice Pudding -0.91603179
21 Tomato Mozzarella Salad Pilaf Rice Rice Pudding -0.62596772
24 Tomato Mozzarella Salad Pizza Margherita Rice Pudding -0.70420369
25 Lentil Salad Sausage and Lentil Stew Rice Pudding -0.65981754
2 Rice Tuna Salad Pilaf Rice Apple Pie -0.54220571
23 Rice Tuna Salad Pizza Margherita Rice Pudding 0.20497105
22 Lentil Salad Pizza Margherita Rice Pudding 0.16908965
1 Lentil Salad Pilaf Rice Apple Pie 0.16835345
3 Tomato Mozzarella Salad Pilaf Rice Apple Pie 0.19034384
10 Lentil Salad Pilaf Rice Fruit Salad 0.07463891
18 Tomato Mozzarella Salad Sausage and Lentil Stew Fruit Salad 0.07237082
27 Tomato Mozzarella Salad Sausage and Lentil Stew Rice Pudding 0.06433270
8 Rice Tuna Salad Sausage and Lentil Stew Apple Pie 0.02929618
9 Tomato Mozzarella Salad Sausage and Lentil Stew Apple Pie 0.04371432
12 Tomato Mozzarella Salad Pilaf Rice Fruit Salad 0.02704901
13 Lentil Salad Pizza Margherita Fruit Salad 0.02770694
14 Rice Tuna Salad Pizza Margherita Fruit Salad 0.03727140
17 Rice Tuna Salad Sausage and Lentil Stew Fruit Salad 0.02932948
26 Rice Tuna Salad Sausage and Lentil Stew Rice Pudding 0.03439883
obs exp local_p
4 0.017 0.03742083 <1/5000
5 0.014 0.03569537 <1/5000
6 0.002 0.03472480 <1/5000
7 0.013 0.03774250 <1/5000
11 0.007 0.03265977 <1/5000
15 0.011 0.03708474 <1/5000
16 0.017 0.04030752 <1/5000
19 0.011 0.03745483 <1/5000
20 0.003 0.03572781 <1/5000
21 0.013 0.03475636 <1/5000
24 0.012 0.04056846 <1/5000
25 0.015 0.04409398 <1/5000
2 0.014 0.03058142 0.000415384615384615
23 0.101 0.04170236 0.123042857142857
22 0.095 0.04371818 0.1854
1 0.077 0.03205968 0.210282352941176
3 0.081 0.02974990 0.210282352941176
10 0.054 0.03423849 0.436547368421053
18 0.058 0.03740352 0.436547368421053
27 0.059 0.04091718 0.45009
8 0.044 0.03600221 0.4788
9 0.047 0.03502330 0.4788
12 0.039 0.03177174 0.4788
13 0.048 0.03996399 0.4788
14 0.049 0.03812127 0.4788
17 0.047 0.03844896 0.4788
26 0.052 0.04206083 0.4788
In this case, we see that there is no significant global association. However, we see that for certain combinations of variables, there is a significant local association. A non-significant global association can hide a significant local association.
Local association measures are issued from empirical research. Although these have proven their interest in diverse applications, theoretical studies of their mathematical properties are sparse. A more theoretical approach to these measures could be of interest. For example, by determining the theoretical null distribution of these measures. Also, we have assumed mutual exclusivity of events for the multivariate association measures. This assumption may be too stringent for certain variables and usage of other independence models such as conditional independence may prove to be worthwhile.
In zebu
, discretization is a necessary step for studying continuous variables. We have restrained ourselves to simple discretization methods: equal-width and user-defined. Other discretization algorithms exist (Dash, Paramguru, and Dash 2011) and may be more adapted for the computation of association measures. Moreover, kernel methods could also be used to handle continuous variables better. Secondly, estimation of probabilities is done from the frequentist maximum-likelihood procedure which requires sufficiently large datasets. Unfortunately, in fields such as health sciences, datasets are sparse. Bayesian estimation methods are more robust to small sample sizes by not relying on asymptomatic assumptions and by allowing integration of prior knowledge (Wilkinson 2007). Such an implementation may also prove to be of interest.