zebu: Local Association Measures

Abstract

Association measures can be local or global. Local association measures quantify the association between specific values of random variables (e.g. chi-square residuals). Global association measures yield a single value used to summarize the association for all values taken by random variables (e.g. chi-square). Classical data analysis has focused on global association and overlooked local association. Consequently, software presently available only allows computation of global association measures. Nonetheless, a significant global association can hide a non-significant local association, and a non-significant global association can hide a significant local association. Moreover, local association measures can be used as a criterion for subgroup analysis. This analysis allows testing if the strength of association between variables depends on the value taken by another variable.
The zebu R package allows estimation of local association measures and implements local association subgroup analysis, and thus, fills an unmet need. It is of interest to a wide range of scientific disciplines such as health and computer sciences and can be used by anyone with a basic knowledge of the R language. It is available on the CRAB and its source code is available at http://github.com/oliviermfmartin/zebu.
Keywords: measure of association, statistical independence, local association, local association subgroup analysis, pointwise mutual information, Ducher’s Z

Summary

Introduction
Background on Association and Independence
Local Association Measures
Local Association Subgroup Analysis
- Introductory Example
- A More Formal Definition
User’s Guide - An Example with Simulated Data: Drug Resistance
Future Research and Development
Competing Interests
Authors’ Contributions
Acknowledgements
References

Introduction

Science is concerned with the explanation of phenomena. This involves establishing causal relationships between variables. However, the underlying causes are not directly observable. To reveal them, one must conduct carefully planned experiments to observe the consequences of causal processes: complicated patterns of independence and association between variables. Although association (i.e. correlation) is not a sufficient condition to establish causation, it can be used as a guide for further investigation. Indeed, causation always implies some pattern of association (Shipley 2000). For this reason, the nature and strength of associations have to be thoroughly described and measured. Accordingly, numerous measures of association, such as Pearson’s r and the chi-square, have been described in the literature (Pecina 2008).

Association measures can be local or global (Van de Cruys 2011). Local association measures quantify the association for specific values of random variables. In the case of a contingency table, they yield one value for each cell. An example is chi-square residuals that are computed when constructing a chi-square test. On the other hand, global association measures yield a single value used to summarize the association for all values taken by random variables. An example is the chi-square statistic, the sum of squared chi-square residuals (Sheskin 2007).

Most often, we are only concerned with the global association and overlook local association. For example, analysis of chi-square residuals is uncommon practice when compared to the chi-square independence test. Nonetheless, a significant global association can hide a non-significant local association, and a non-significant global association can hide a significant local association (Anselin 1995). Accordingly, analysis of the association should not limit itself with the global perspective. Indeed, the association between two variables can depend on their values. For example, in threshold mechanisms, variables are only associated with each other when one takes values above a certain critical level. In this case, local association measures allow pinpointing values for which variables are associated. Moreover, the existence of an association between two variables may depend on the value of a third variable. For example, the effect of a drug will depend on the patient’s sensibility to the drug. The local association between drug intake and recovery will not be the same for patients that are sensitive then to those that are resistant to the drug. They form two different local association subgroups. Comparison of these subgroups in terms of other variables may help explain their differences. We will refer to this procedure as local association subgroup analysis.

Local association measures are uncommonly used in the scientific literature, in terms of the number of articles. Nonetheless, the diversity of fields interested in these measures demonstrates their significance. Indeed, applications are found in computational linguistics (Van de Cruys 2011), image processing (Isola et al. 2014) and health sciences such as cardiology (Michel Ducher et al. 2003) (Sapoznikov, Dranitzki Elhalel, and Rubinger 2013) and toxicology (Ducher et al. 2001). Notwithstanding, software presently available only allows computation of global association measures. This is why we have developed the zebu R package described in this paper. It is available on the CRAN and its source code is available at http://github.com/oliviermfmartin/zebu.

The rest of the paper is organized as following. We first give the reader the necessary intuition and mathematical background about global and local association. This leads to the description of Ducher’s Z (Ducher et al. 1994) and pointwise mutual information (Van de Cruys 2011). We introduce multivariate forms of these measures and suggest a normalization scheme for pointwise mutual information. We then present local association subgroup analysis. Subsequently, we illustrate the usage of local association measures and local subgroup analysis using the zebu R package. This will be undertaken using an example with simulated data about drug resistance. The paper ends with a discussion about future development and research.

Background on Association and Independence

Throughout the paper, we will suppose that all random variables are discrete and write them in capital letters such as $A$ and $B$. Possible values of these random variables (i.e. events) will be written in lower letters such as $a$ and $b$.

One way to think about statistical association is as events co-occurring. For example, if event $a$ always occurs with event $b$, then these events are said to be associated. An intuitive measure of association could be the joint probability: $p(a, b)$, the long-term frequency of events showing up together. However, this measure fails if $a$ or $b$ is a rare event. Indeed, joint probabilities are always as small as its individual events are rare: $p(a, b) \leq \min p(a), p(b)$. As a consequence, it is necessary to compare observed probabilities $p(a, b)$ to expected probabilities in which the variables are considered independent. The expected probability, if events are independent, is the factor of marginalized probabilities of events: $p(a) p(b)$. Independence is then defined by the following mathematical relation, $p(a, b) = p(a) p(b)$ and local association measures are defined to be equal to zero.

Independence implies that knowing one or more variables does not give us any information about the others. This is exactly what we are not interested in. It is, however, possible define two interesting cases where the former equality does not hold: co-occurrence and mutual exclusivity. Co-occurrence, or positive association, is defined as events showing up more often than expected: $p(a, b) > p(a) p(b)$ and local association measures are positive. Mutual exclusivity, or negative association, is defined as events showing up less often than expected: $p(a, b) < p(a) p(b)$ and local association measures are negative.

It should, however, be noted that statistical independence is not the only manner to construct an association measure. Other possibilities are based on the proportion of explained variance such as Pearson’s r. These former measures are parametric and suppose linear or at least monotone relationships between variables. Although intuitive and convenient, this assumption is not always justified. Measures based on statistical independence provide a non-parametric alternative that can detect non-linear relationships.

Local Association Measures

Derivation of Bivariate Forms

For each combination of events $a$ and $b$, their local association can be estimated. This is accomplished by comparing the observed from the expected probability of $a$ and $b$. If these probabilities are equal, then events $a$ and $b$ are independent. If not, these events are associated; the sign of the measure indicates the orientation of the relationship, and the absolute value indicates its strength. There are different measures to compare observed and expected probabilities, for example, by using subtraction and division. Here under, we define the difference noted $dif$ and the pointwise mutual information noted $pmi$ (Van de Cruys 2011). To simplify notation, and to show similarities between local association measures, we define $h(a) = - \log p(a)$ as the self-information of $a$.

\[ \begin{aligned} dif(a, b) & = p(a, b) - p(a) p(b) \\ pmi(a, b) & = \log \frac{p(a, b)} {p(a) p(b)} = - (h(a, b) - h(a) - h(b)) \end{aligned} \]

The bounds of these measures are dependent on the marginal probabilities: $p(a)$ and $p(b)$. In particular, they are dependent with the minimal marginal probability $\min p(a), p(b)$ because $p(a, b) \leq \min p(a), p(b)$. This makes it difficult to compare values for different combinations of events. In that respect, it is desirable to normalize these measures so that they only take values between -1 and 1 included. This can be solved by using dividing the non-normalized values by their minimal or maximal values. Let us first identify the minimal and maximal values of $dif$ and $pmi$.

The bounds of the observed probability $p(a, b)$ are $[0, \min p(a), p(b)]$. This means that $dif$ and $pmi$ are minimized when $p(a, b) = 0$.

\[ \begin{aligned} \min dif(a, b) & = - p(a) p(b) \\ \min pmi(a, b) & = \lim_{p(a, b) \to 0} pmi(a, b) = -\infty \end{aligned} \]

Similarly, $dif$ and $pmi$ are maximized when $p(a, b) = \min p(a), p(b)$.

\[ \begin{aligned} \max dif(a, b) & = \min(p(a), p(b)) - p(a) p(b) \\ \max pmi(a, b) & = \log \frac{\min p(a), p(b)}{p(a) p(b)} = - (\min(h(a), h(b)) - h(a) - h(b)) \end{aligned} \]

By dividing by maximal and minimal values, we can normalize $dif$. We will refer to the normalized $dif$ by the capital $Z$ because it corresponds to Ducher’s $Z$ (Ducher et al. 1994).

\[ Z(a, b) = \begin{cases} \frac{ dif(a, b) }{ \max z(a, b) } = \frac{ p(a, b) - p(a) p(b) }{ \min(p(a), p(b)) - p(a) p(b) } & dif(a, b) > 0 \\ \\ \frac{ dif(a, b) }{ - \min dif(a, b) } = \frac{ p(a, b) - p(a) p(b) }{ p(a) p(b) } & dif(a, b) < 0 \\ \\ 0 & dif(a, b) = 0 \end{cases} \]

A normalization scheme for $pmi$ has already been suggested by Bouma (2009). Nonetheless, it is easy to show that this scheme does not hold for more than two variables. Accordingly, we suggest using the normalization scheme used for Ducher’s Z so that it holds in the multivariate case. Normalization of the negative case of $pmi$ is more subtle because $pmi(a, b)$ tends to $\infty$ when $pi(a, b)$ tends to 0. Nonetheless, dividing $pmi(a, b)$ by $- h(a, b)$ solves this problem by making $npmi(a, b)$ tend to -1 when $p(a, b)$ tends to 0.

\[ npmi(a, b) = \begin{cases} \frac{pmi(a, b)}{\max pmi(a, b)} = \frac{ h(a, b) - h(a) - h(b) }{ \min(h(a), h(b)) - h(a) - h(b) } & pmi(a, b) > 0 \\ \\ \frac{ pmi(a, b) }{- h(a, b) } = \frac{ h(a, b) - h(a) - h(b) }{ h(a, b) } & pmi(a, b) < 0 \\ \\ 0 & pmi(a, b) = 0 \end{cases} \]

In the zebu package, it is possible to estimate Ducher’s $Z$, $pmi$ and $npmi$ using the lassie function.

Global Association

Global association measures yield a single value used to summarize the association for all values taken by the random variables. For example, mutual information is computed as the sum for all events of their observed probability times their pointwise mutual information. All global association measures in zebu are defined likewise.

\[ \begin{aligned} gZ(A, B) &= \sum_{a, b} p(a, b) z(a, b) \\ MI(A, B) &= \sum_{a, b} p(a, b) pmi(a, b) \\ NMI(A, B) &= \sum_{a, b} p(a, b) npmi(a, b) \\ \end{aligned} \]

Permutation Test

It is important to distinguish the strength of association from its statistical significance. Indeed, a strong association can be non-significant (e.g. some physical law with small sample size) and a weak association can be significant (e.g. epidemiological risk factor with big sample size). Significance can be accessed using p-values estimated using the theoretical null distribution or by resampling techniques (Sheskin 2007). Because the theoretical null distributions of local association measures is unknown, the zebu package resorts to estimating p-values by a permutation test. This can be accessed using the permtest function.

Derivation of Multivariate Forms

To derive multivariate forms of these local association measures we assume that events are mutually independent. This means that for $n$ random variables $X_1, \ldots, X_n$, independence is defined by: $p(x_1, \ldots, x_n) = \prod_{i=1}^{n} p(x_i)$. The same reasoning is applied and the following formulas are identified.

\[ Z(x_1, \ldots, x_n) = \begin{cases} \frac{ p(x_1, \ldots, x_n) - \prod_{i=1}^{n} p(x_i) }{ \min(p(x_1), \ldots, p(x_n)) - \prod_{i=1}^{n} p(x_i) } & dif(x_1, \ldots, x_n) > 0 \\ \\ \frac{ p(x_1, \ldots, x_n) - \prod_{i=1}^{n} p(x_i) }{ \prod_{i=1}^{n} p(x_i) } & dif(x_1, \ldots, x_n) < 0 \\ \\ 0 & dif(x_1, \ldots, x_n) = 0 \end{cases} \]

\[ npmi(x_1, \ldots, x_n) = \begin{cases} \frac{ h(x_1, \ldots, x_n) - \sum_{i=1}^{n} h(x_i) }{ \min(h(x_1), \ldots, h(x_n)) - \sum_{i=1}^{n} h(x_i) } & pmi(x_1, \ldots, x_n) > 0 \\ \\ \frac{ h(x_1, \ldots, x_n) - \sum_{i=1}^{n} h(x_i) }{ h(x_1, \ldots, x_n) } & pmi(x_1, \ldots, x_n) < 0 \\ \\ 0 & pmi(x_1, \ldots, x_n) = 0 \end{cases} \]

These multivariate association measures may help identify complex association relationships that cannot be identified only with bivariate association measures. For example, in the XOR gate, the output of the gate is not associated with any of the two inputs individually (Jakulin and Bratko 2003). The association is only revealed when the two inputs and the output are taken together.

Local Association Subgroup Analysis

Introductory Example

In order to describe this methodology, an illustrative example concerning salt consumption and blood pressure will be discussed. This is widely inspired from M Ducher et al. (2003). Blood pressure is thought to be linearly related to salt consumption. However, evidence supporting this association of variables is widely contradictory (Freedman and Petitti 2001). This suggests that a global relationship may not be applicable to all individuals, but rather only to a subgroup of salt-sensitive individuals. These are to be opposed to salt-resistant individuals for whom no relationship can be established (Kaplan 2010). Global association measures may not be sensitive enough because salt-resistant individuals “dilute” the association that exists for salt-sensitive individuals.

Local association measures allow quantifying association for specific values of salt consumption and blood pressure. Accordingly, individuals can be classified into three corresponding subgroups: independent, positive and negative local association. The positive subgroup corresponds to the subset of values that are well explained by the global association of variables (e.g. low blood pressure and low salt consumption, or high blood pressure and high salt consumption). The corresponding subgroup will thus be composed individuals statistically sensitive to salt. The negative subgroup corresponds to the subset of values badly explained by the global relationship (e.g. low blood pressure and high salt consumption). The corresponding subgroup will thus be composed of individuals statistically resistant to salt. Finally, the independent subgroup corresponds to values for which variables are independent. Once these local subgroups are formed, the global and local association between these subgroups and values of other variables can then be used to determine what distinguishes salt-sensitive from salt-resistant individuals. For example, one may find that young individuals are more resistant to salt (i.e. negative or independent subgroup associated to young age) than older individuals (i.e. positive subgroup associated to old age) (Weinberger 1996).

A More Formal Defintion

The goal of local association subgroup analysis is to identify values $c$ of a random variable $C$ for which the association between random variables $A$ and $B$ depends on. For this, we compute the local association $L$ for all values of variables $A$ and $B$ using lassie. It is then possible to define three subgroups in function of the value taken by $L(a, b)$. The definition of these subgroups can also take into account p-values (as estimated by permtest) to distinguish significantly associated values from independent values. In other words, this corresponds to merging variables $A$ and $B$ into a new variable $S$ as following:

\[ \begin{aligned} Positive&: \{(a, b) \; |\ L(a, b) > 0 \} \\ Independant&: \{(a, b) \; |\ L(a, b) = 0 \} \\ Negative&: \{(a, b) \; |\ L(a, b) < 0 \} \\ \end{aligned} \]

The local association between subgroups $S$ and another variable $C$ can then be estimated. This allows us to identify values $c$ of $C$ that determine the association between $A$ and $B$. In the zebu package, this procedure can be undertaken using the subgroups function. Accordingly, the significance of association can be accessed using the permtest function.

User’s Guide - An Example with Simulated Data: Drug Resistance

Data Simulation

The relevance of local association measures and the usage of the zebu package will be illustrated with a simulated dataset of a clinical trial in which patient recovery is dependent on drug intake and their resistance to the drug. Please keep in mind that the goal of this example is not to be realistic, but rather pedagogic.

Briefly, the dataset is composed of 100 sick patients that are randomly allocated to the placebo or the drug group (50-50). These patients are characterized by a resistance to the drug as modeled by a binary variable; only 20 percent of the patients are sensitive. The health status of patients is monitored through a biomarker that takes continuous values between 0 and 1. Patients with levels above 0.7 are considered as having recovered. Pretreatment levels are modeled by a normal distribution centered around 0.2. The drug has a mean positive effect of 0.5 on biomarker levels for drug-sensitive patients and no effect on resistant patients. The placebo has a positive mean effect of 0.3. The example is constructed so that only drug-sensitive drug-treated patients recover. For more details about the data simulation, see the next section and the make_trial_dataset function.

Global and Local Association Between Taking the Drug and Recovery

The first step is to load the zebu R package and the trial dataset as following.

set.seed(63) # Set seed for reproducibility

library(zebu) # Load zebu
data(trial) # Load trial dataset
head(trial) # Show head of trial dataset

     drug resistance   prebiom  postbiom
1 placebo  resistant 0.4273682 0.6497984
2    drug  resistant 0.2395317 0.4099096
3    drug  resistant 0.2551785 0.4439521
4    drug  resistant 0.3165800 0.5934810
5 placebo  resistant 0.2989971 0.4741008
6 placebo  resistant 0.3563302 0.5332050

Before we continue, we may wish to plot a histogram of biomarker values of before and after treatment for different groups of patients to explore the data. Pretreatment biomarker levels are the same for every group. Simulated posttreatment levels confirm that only the drug-sensitive drug-treated group had values above 0.7 and were considered as having recovered.

ggplot(trial, aes(prebiom, fill = interaction(drug, resistance))) + 
  geom_histogram(alpha=0.5, position="identity", bins = 10) +
  xlab("Biomarker levels before treatment") +
  ylab("Number of Patients") +
  xlim(c(0, 1)) +
  scale_fill_discrete(name = "Patient", 
                      labels = c("Resistant and Drug", 
                                 "Resistant and Placebo",
                                 "Sensitive and Drug", 
                                 "Sensitive and Placebo"))

ggplot(trial, aes(postbiom, fill = interaction(drug, resistance))) + 
  geom_histogram(alpha=0.5, position="identity", bins = 10) +
  xlab("Biomarker levels after treatment") +
  ylab("Number of Patients") +
  xlim(c(0, 1)) +
  scale_fill_discrete(name = "Patient", 
                      labels = c("Resistant and Drug", 
                                 "Resistant and Placebo",
                                 "Sensitive and Drug", 
                                 "Sensitive and Placebo")) +
  geom_vline(xintercept = 0.7)

The local (and global) association between drug intake and patient recovery can be estimated using the lassie function. This function takes at least one argument: a data.frame, here the trial dataset.

Columns are selected using the select arguments (column names or numbers). Variables are assumed to be categorical; continuous variables have to be specified using the continuous argument and the number of discretization bins with the breaks argument (as in the cut function). The local association measure that we use here is Ducher’s Z as specified by setting the measure argument equal to "z".

las <- lassie(trial, 
              select = c("drug", "postbiom"), 
              continuous = "postbiom", 
              breaks = c(0, 0.7, 1), 
              measure = "z")

The permtest function accesses the significance of local (and global) association using a permutation test. The number of iterations is specified by nb and the adjustment method of p-values for multiple comparisons by p_adjust (as in the p.adjust function). A progress bar is also available to make computations seem shorter than they actually are.

las <- permtest(las, 
                nb = 1000, 
                p_adjust = "BH", 
                progress_bar = FALSE)

The lassie and permtest functions returns a lassie S3 object, as well as permtest for permtest. lassie objects can be visualized using the plot and print methods. Moreover, results can be saved in CSV format using write.lassie.

print(las)

Measure: Ducher's Z
Global: 0.575539971949509 (p-value: <1/1000)
         postbiom
drug          [0,0.7] (0.7,1]
  drug    -0.08835905       1
  placebo  1.00000000      -1

plot(las)

The plot function returns a heatmap with local association and p-values displayed between parenthesis. In this example, we can see that the global association between drug intake and patient recovery is strong and statistically significant ($gZ = 0.576, \, p < \frac{1}{1000}$). This would normally be interpreted as a positive effect of the drug on patient recovery. However, our simulation supposes that only 20% of patients are sensitive to the drug. The former conclusion would thus be wrong in 80% of cases. Inspection of local association is of help here.

There is no local association between the events ‘drug’ and ‘not recovering’ ($Z = -0.088, , p = 0.295 $). In plain English, this means that certain patients are insensitive (resistant) to the drug. Comparison of these patients with patients that exhibit positive (or negative) association may help identify differences between these two subgroups and explain why they are resistant to the drug. This can be done using local association subgroup analysis. Finally, note here that a significant global association can hide a non-significant local association.

Local Association Subgroup Analysis on Drug Resistance

Local association subgroup analysis can be called using the subgroups function. Here we wish to compare the local association between drug intake and patient recovery according to the values of a third variable, patient drug resistance. subgroups takes at least two arguments: a lassie object las (association between drug intake and patient recovery) and a data.frame x.

The same optional arguments as in the lassie function, select, continuous and breaks, can be specified. These refer to the x dataset. Here, we only select the variable named resistance. This could, for example, refer to the gene of the drug target or of some drug efflux protein.

The optional arguments thresholds, significance and alpha specify how local association groups should be constructed. thresholds specifies local association value thresholds for subgroups. significance specifies if p-values should be taken into account and alpha the corresponding p-value threshold (alpha error).

sub <- subgroups(las = las, 
                 x = trial, 
                 select = "resistance", 
                 thresholds = c(-0.01, 0.01),
                 significance = TRUE,
                 alpha = 0.01)

Significance of local (and global) association between subgroups and drug resistance can be accessed using permtest

sub <- permtest(sub, nb = 1000)

The subgroups function also returns a lassie S3 object with the same methods of interest: print, plot and write.lassie

print(sub)

Measure: Ducher's Z
Global: 0.509652139144342 (p-value: <1/1000)
           drug_postbiom
resistance  Independent   Positive
  resistant           1 -0.1403439
  sensitive          -1  1.0000000

plot(sub)

The global association between local association subgroups and drug resistance is strong and statistically significant ($gZ = 0.51, \, p < \frac{1}{1000})$). This indicates that the resistance variable as an influence on the association between drug intake and patient recovery. The local association indicates that drug-sensitive patients are over-represented in the positive local association subgroup. This shows that these patients exhibit a positive correlation between drug intake and recovery. Moreover, drug-resistant patients are over-represented in the independent local association subgroup. This shows that there is no correlation between drug intake and recovery for these patients. To state this in a trivial manner, only drug-sensitive patients are sensitive to the drug.

Multivariate Association Between Drug Intake, Recovery and Resistance

The number of variables that can be handled in the zebu package is not limited. Hereunder, for illustration, we estimate the trivariate association between drug intake, recovery, and resistance. The permtest function gives control on how to permute the dataset through the group argument. group is a list of characters corresponding to colnames. Permutations are performed per group meaning that the association structure is not broken within groups but only between them. In our case, we are studying the relation between postbiom and resistance with drug and only want to break the association structure with the drug response, but not between postbiom and resistance.

In this case, we obtain a multidimensional local association array. Because of this, results cannot be plotted as a tile plot; the plot method is not available. The print method allows visualizing results by melting the array into a data.frame sorted by decreasing local association.

las2 <- lassie(trial, 
               select = c("drug", "postbiom", "resistance"), 
               continuous = "postbiom", 
               breaks = c(0, 0.7, 1))
las2 <- permtest(las2, 
                 group = list("drug", c("postbiom", "resistance")), progress_bar = FALSE)
print(las2)

Measure: Ducher's Z
Global: 0.295109223099549 (p-value: 0.016)
     drug postbiom resistance      local  obs      exp           local_p
7    drug  (0.7,1]  sensitive  1.0000000 0.07 0.005796           <1/1000
1    drug  [0,0.7]  resistant  0.3589978 0.39 0.350796 0.242666666666667
6 placebo  [0,0.7]  sensitive  0.2187849 0.11 0.090396             0.818
2 placebo  [0,0.7]  resistant  0.1419389 0.43 0.411804 0.748571428571429
8 placebo  (0.7,1]  sensitive -1.0000000 0.00 0.006804           <1/1000
5    drug  [0,0.7]  sensitive -1.0000000 0.00 0.077004           <1/1000
4 placebo  (0.7,1]  resistant -1.0000000 0.00 0.030996           <1/1000
3    drug  (0.7,1]  resistant -1.0000000 0.00 0.026404           <1/1000

The global trivariate association is weak and its associated p-value not particuliary significant ($Z = 0.295, \, p = 0.016$). This is probably because of the absence of a relationship between resistance and the other variables. Nonetheless, certain events are locally associated. For example, being in the drug group, having recovered and being sensitive to the drug are positively associated events ($Z = 1, \, p < \frac{1}{1000}$). This corresponds to the patients that have reacted to the drug. Note here that a non-significant global association can hide a significant local association.

Future Research and Development

Local association measures are issued from empirical research. Although these have proven their interest in diverse applications, theoretical studies of their mathematical properties are sparse. For example, only Monte Carlo simulations of Ducher’s Z behavior are available (Ducher et al. 1994). A more theoretical approach to these measures could be of interest. For example, by determining the theoretical null distribution of these measures. In addition, we have assumed mutually exclusivity of events for the multivariate association measures. This assumption may be too stringent for certain variables and usage of other independence models such as conditional independence may prove to be worthwhile.

Improvements to the zebu R package are also possible. For example, in zebu, discretization is a necessary step for studying continuous variables. We have restrained ourselves to very simple discretization methods: equal-width and user-defined. Other discretization algorithms exist (Dash, Paramguru, and Dash 2011) and may be more adapted for the computation of association measures. Moreover, kernel methods could also be used to better handle continuous variables. Secondly, estimation of probabilities is done from the frequentist maximum-likelihood procedure which requires sufficiently large datasets. Unfortunately, in certain fields such as health sciences, datasets are sparse. Bayesian estimation methods have been shown to be more robust to small sample sizes by not relying on asymptomatic assumptions and by allowing integration of prior knowledge (Wilkinson 2007). Such an implementation may also prove to be of interest. Finally, the permtest function in Zebu is based on an iterative procedure that is slow in R. To speed this up, writing the function in C and calling it from R could be a reliable solution.

Competing Interests

The authors declare that they have no competing interests.

Authors’ Contributions

MD conceived this project. OM wrote the software code. MD contributed to software development by testing and providing constructive critical comments. OM wrote the manuscript. MD had the primary responsibility for the final content. All authors read and approved the final manuscript.

Acknowledgements

The authors are grateful to Pascal Maire for making this project possible and to Cécile Serre and Laurent Bourguignon for precious advice.

References

Anselin, Luc. 1995. “Local Indicators of Spatial Association—LISA.” Geographical Analysis 27 (2). Blackwell Publishing Ltd: 93–115.

Bouma, Gerlof. 2009. “Normalized (Pointwise) Mutual Information in Collocation Extraction.” Proceedings of the Biennial GSCL Conference, 31–40.

Dash, Rajashree, Rajib Lochan Paramguru, and Rasmita Dash. 2011. “Comparative Analysis of Supervised and Unsupervised Discretization Techniques.” International Journal of Advances in Science and Technology 2 (3).

Ducher, M, JP Fauvel, M Maurin, M Laville, P Maire, CZ Paultre, and C Cerutti. 2003. “Sodium Intake and Blood Pressure in Healthy Individuals.” Journal of Hypertension 21 (2): 289–94.

Ducher, M., C. Cerutti, M.P. Gustin, and C.Z. Paultre. 1994. “Statistical Relationships Between Systolic Blood Pressure and Heart Rate and Their Functional Significance in Conscious Rats.” Medical and Biological Engineering and Computing 32 (6): 649–55.

Ducher, Michel, Jean-Pierre Fauvel, Michel Maurin, Maurice Laville, Pascal Maire, Christian Z Paultre, and Catherine Cerutti. 2003. “Sodium Intake and Blood Pressure in Healthy Individuals.” Journal of Hypertension 21 (2). LWW: 289–94.

Ducher, Michel, Pascal Maire, Catherine Cerutti, Yann Bourhis, Fréderic Foltz, Pernille Sorensen, Roger Jelliffe, and Jean-Pierre Fauvel. 2001. “Renal Elimination of Amikacin and the Aging Process.” Clinical Pharmacokinetics 40 (12). Springer: 947–53.

Freedman, David A, and Diana B Petitti. 2001. “Salt and Blood Pressure Conventional Wisdom Reconsidered.” Evaluation Review 25 (3). Sage Publications: 267–87.

Isola, Phillip, Daniel Zoran, Dilip Krishnan, and Edward H. Adelson. 2014. “Crisp Boundary Detection Using Pointwise Mutual Information.” In Computer Vision ECCV 2014, edited by David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, 799–814. Lecture Notes in Computer Science 8691. Springer International Publishing.

Jakulin, Aleks, and Ivan Bratko. 2003. “Analyzing Attribute Dependencies.” In Knowledge Discovery in Databases: PKDD 2003, edited by Nada Lavrac, Dragan Gamberger, Ljupco Todorovski, and Hendrik Blockeel, 229–40. Lecture Notes in Computer Science 2838. Springer Berlin Heidelberg.

Kaplan, Norman M. 2010. Kaplan’s Clinical Hypertension. Lll. Lippincott Williams & Wilkins.

Pecina, Pavel. 2008. “Machine Learning Approach to Multiword Expression Extraction.”

Sapoznikov, Dan, Michal Dranitzki Elhalel, and Dvora Rubinger. 2013. “Heart Rate Response to Blood Pressure Variations: Sympathetic Activation Versus Baroreflex Response in Patients with End-Stage Renal Disease.” PLoS ONE 8 (10): e78338.

Sheskin, David J. 2007. Handbook of Parametric and Nonparametric Statistical Procedures. 4th ed. Chapman & Hall/CRC.

Shipley, Bill. 2000. Cause and Correlation in Biology: A User’s Guide to Path Analysis, Structural Equations and Causal Inference. Cambridge University Press.

Van de Cruys, Tim. 2011. “Two Multivariate Generalizations of Pointwise Mutual Information.” In Proceedings of the Workshop on Distributional Semantics and Compositionality, 16–20. DiSCo ’11. Association for Computational Linguistics.

Weinberger, Myron H. 1996. “Salt Sensitivity of Blood Pressure in Humans.” Hypertension 27 (3). Am Heart Assoc: 481–90.

Wilkinson, Darren J. 2007. “Bayesian Methods in Bioinformatics and Computational Systems Biology.” Briefings in Bioinformatics 8 (2): 109–16.