MultiFIT: Multiscale Fisher’s Independence Test for Multivariate Dependence

Examples

First Example:

Generate Data:

set.seed(1)
# Generate data for two random vectors, each of dimension 2, 300 observations:
n=300
x = matrix(0, ncol=2, nrow=n)
y = matrix(0, ncol=2, nrow=n)

# x1 and y1 are i.i.d Normal(0,1):
x[,1]=rnorm(n)
y[,1]=rnorm(n)
    
# x2 is a Uniform(0,1):  
x[,2]=runif(n)

# and y2 is depends on x2 as a noisy sine function:
y[,2]=sin(5*pi*x[,2]) + 0.6*rnorm(n)

plot(x[,1],y[,1], col="grey", pch="x", xlab="x1", ylab="y1")
plot(x[,1],y[,2], col="grey", pch="x", xlab="x1", ylab="y2")
plot(x[,2],y[,1], col="grey", pch="x", xlab="x2", ylab="y1")
plot(x[,2],y[,2], col="grey", pch="x", xlab="x2", ylab="y2")

Run the Test:

library(MultiFit)
fit = multiFit(x=x, y=y)
fit$p.values

##            H   Hcorrected           MH 
## 2.800233e-05 1.937348e-05 1.267042e-05

In order to get a better sense of the workings of the function, choose verbose=TRUE:

# Data may also be transferred to the function as a single list:
xy = list(x=x,y=y)
fit = multiFit(xy, verbose=TRUE)

## Applying rank transformation
## Testing and computing mid-p corrected p-values:
## Resolution 0/4: performing 4 tests
## Resolution 1/4: scanning 16 cuboids, performing 32 tests
## Resolution 2/4: scanning 4 cuboids, performing 16 tests
## Resolution 3/4: scanning 16 cuboids, performing 56 tests
## Resolution 4/4: scanning 12 cuboids, performing 48 tests
## No potential parents in resolution 4 have p-values below threshold.
## Time difference of 0.02850723 secs
## Individual tests completed, post-processing...
## H: Computing all adjusted p-values...
## Hcorrected: Computing all adjusted p-values...
## MH: Computing CDF...
## MH: Computing all adjusted p-values...
## Time difference of 0.0009000301 secs
## 
## Fisher's Exact Tests:
## Mean of -log(p-values): 1.62688
## Mean of top 4 -log(p-values): 9.83994
## Mean of -log(p-values with mid-p correction): 1.82527
## Mean of top 4 -log(p-values with mid-p correction): 10.2949
## 
## Global p-value, Holm on p-values: 2.80023e-05
## Global p-value, Holm on p-values with mid-p correction: 1.93735e-05
## Global p-value, Modified Holm step-down: 1.26704e-05

The output details the number of tests performed at each resolution. The default testing method for the marginal \(2\times2\) contingency tables is Fisher’s exact test. Several global test statistics are reported:

Mean of \(-log(\text{p-values})\)
Mean of top 4 \(-log(\text{p-values})\)
Mean of \(-log(\text{p-values with mid-p correction})\)
Mean of top 4 \(-log(\text{p-values with mid-p correction})\)).

These are not associated with p-values until we genrate a permutation null distribution for them using permNullTest. The default multiple testing adjustments methods we use are Holm’s method on the original p-values (H), Holm’s method on the mid-p corrected p-values (Hcorrected) and a Modified Holm (MH)¹. The p-value for the global null hypothesis that \(\mathbf{x}\) is independent of \(\mathbf{y}\) is reported for each adjustment method.

Summarize Results (1):

In order to get a sense of the specific marginal tests that are significant at the alpha=0.005 level, we may use the function multiSummary:

multiSummary(xy=xy, fit=fit, alpha=0.05)

## 
## The following tests had a p-value of less than 0.05:
## Ranked #1, Test 32: x2 and y2 | 0<=x2<0.46 (p-value=1.267042e-05)

## Ranked #2, Test 44: x2 and y2 | 0.22<=x2<0.46 (p-value=0.005794476)

## Ranked #3, Test 48: x2 and y2 | 0<=x2<0.46, -2.39<=y2<0.22 (p-value=0.01074775)

In grey and orange are all data points outside the cuboid we are testing. In orange are the points that were in the cuboid if we were not to condition on the margins that are visible in the plot. In red are the points that are inside the cuboid after we condition on all the margins, including those that are visible in a plot. The blue lines delineate the quadrants along which the discretization was performed: we count the number of red points in each quadrant, treat these four numbers as a \(2\times2\) contingency table and perform a 1-degree of freedom test of independence on it (default test: Fisher’s exact test).

Summarize Results (2):

We may also draw a directed acyclic graph where nodes represent tests as demonstrated above in the multiSummary output. An edge from one test to another indicates that the latter test is performed on half the portion of the sample space on which the former was performed. Larger nodes correspond to more extreme p-values for the test depicted in it (storing the output as a pdf file):

# And plot a DAG representation of the ranked tests:
library(png)
library(qgraph)

## Registered S3 methods overwritten by 'huge':
##   method    from   
##   plot.sim  BDgraph
##   print.sim BDgraph

multiTree(xy=xy, fit=fit, filename="first_example")

## Output stored in /tmp/RtmpqAoTDq/Rbuild24f06b3e9379/MultiFit/vignettes/first_example.pdf

We see that, in agreement with the output of the multiSummary function, nodes 32, 44 and 48 (which correspond to tests 32, 44 and 48) are the largest compared to the other nodes.

Test More Cuboids:

In the default setting, p_star, the fixed threshold for \(p\)-values of tests that will be further explored in higher resolutions, is set to \((D_x\cdot D_y\cdot \log_2(n))^{-1}\). We may choose, e.g., p_star=0.1, which takes longer. In this case the MultiFit identifies more tables with adjusted p-values that are below alpha=0.005. However, the global adjusted p-values are less extreme than when performing the MultiFit with fewer tests:

fit1 = multiFit(xy, p_star = 0.1, verbose=TRUE)

## Applying rank transformation
## Testing and computing mid-p corrected p-values:
## Resolution 0/4: performing 4 tests
## Resolution 1/4: scanning 16 cuboids, performing 32 tests
## Resolution 2/4: scanning 20 cuboids, performing 76 tests
## Resolution 3/4: scanning 68 cuboids, performing 196 tests
## Resolution 4/4: scanning 88 cuboids, performing 284 tests
## No potential parents in resolution 4 have p-values below threshold.
## Time difference of 0.04033327 secs
## Individual tests completed, post-processing...
## H: Computing all adjusted p-values...
## Hcorrected: Computing all adjusted p-values...
## MH: Computing CDF...
## MH: Computing all adjusted p-values...
## Time difference of 0.001901627 secs
## 
## Fisher's Exact Tests:
## Mean of -log(p-values): 1.31104
## Mean of top 4 -log(p-values): 16.4356
## Mean of -log(p-values with mid-p correction): 1.52395
## Mean of top 4 -log(p-values with mid-p correction): 16.9313
## 
## Global p-value, Holm on p-values: 9.60421e-13
## Global p-value, Holm on p-values with mid-p correction: 4.81733e-13
## Global p-value, Modified Holm step-down: 1.3204e-13

multiSummary(xy=xy, fit=fit1, alpha=0.005, plot.tests=FALSE)

## 
## The following tests had a p-value of less than 0.005:
## Ranked #1, Test 100: x2 and y2 | 0.46<=x2<0.74 (p-value=1.320403e-13)
## Ranked #2, Test 32: x2 and y2 | 0<=x2<0.46 (p-value=2.925387e-05)

In order to perform the test even more exhaustively, one may:

# 1. set p_star=Inf, running through all tables up to the maximal resolution
# which by default is set to log2(n/100):
ex1 = multiFit(xy, p_star = 1)

# 2. set both p_star=1 and the maximal resolution R_max=Inf.
# In this case, the algorithm will scan through higher and higher resolutions,
# until there are no more tables that satisfy the minimum requirements for 
# marginal totals: min.tbl.tot, min.row.tot and min.col.tot (whose default values 
# are presented below):
ex2 = multiFit(xy, p_star = 1, R_max=Inf,
               min.tbl.tot = 25L, min.row.tot = 10L, min.col.tot = 10L)

# 3. set smaller minimal marginal totals, that will result in testing 
# even more tables in higher resolutions:
ex3 = multiFit(xy, p_star = 1, R_max=Inf,
               min.tbl.tot = 10L, min.row.tot = 4L, min.col.tot = 4L)

A Local Signal:

MultiFit excels in locating very localized signals.

Generate a Local Signal:

# Generate data for two random vectors, each of dimension 2, 800 observations:
n=800
x = matrix(0, ncol=2, nrow=n)
y = matrix(0, ncol=2, nrow=n)

# x1, x2 and y1 are i.i.d Normal(0,1):
x[,1]=rnorm(n)
x[,2]=rnorm(n)
y[,1]=rnorm(n)

# y2 is i.i.d Normal(0,1) on most of the space:
y[,2]=rnorm(n)
# But is linearly dependent on x2 in a small portion of the space:
w=rnorm(n)
portion.of.space = x[,2]>0 & x[,2]<0.7 & y[,2]>0 & y[,2]<0.7
y[portion.of.space,2] = x[portion.of.space,2]+(1/12)*w[portion.of.space]
xy.local = list(x=x, y=y)

Search for It and Summarize the Results:

Truly local signals may not be visible to our algorithm in resolutions that are lower than the one that the signal is embedded in. In order to cover all possible tests up to a given resolution (here: resolution 4), we use the parameter R_star=4 (from resolution 5 onwards, only tables with \(p\)_values below p_star will be further tested):

fit.local = multiFit(xy=xy.local, R_star=4, verbose=TRUE)

## Applying rank transformation
## Testing and computing mid-p corrected p-values:
## Resolution 0/6: performing 4 tests
## Resolution 1/6: scanning 16 cuboids, performing 32 tests
## Resolution 2/6: scanning 128 cuboids, performing 160 tests
## Resolution 3/6: scanning 640 cuboids, performing 640 tests
## Resolution 4/6: scanning 2560 cuboids, performing 2240 tests
## Resolution 5/6: scanning 140 cuboids, performing 524 tests
## Resolution 6/6: scanning 16 cuboids, performing 64 tests
## No pairs of margins in resolution 6 had enough observations to be tested.
## Time difference of 0.1804612 secs
## Individual tests completed, post-processing...
## H: Computing all adjusted p-values...
## Hcorrected: Computing all adjusted p-values...
## MH: Computing CDF...
## MH: Computing all adjusted p-values...
## Time difference of 0.02168202 secs
## 
## Fisher's Exact Tests:
## Mean of -log(p-values): 0.815441
## Mean of top 4 -log(p-values): 9.83624
## Mean of -log(p-values with mid-p correction): 0.999486
## Mean of top 4 -log(p-values with mid-p correction): 10.2239
## 
## Global p-value, Holm on p-values: 1.61645e-05
## Global p-value, Holm on p-values with mid-p correction: 8.11708e-06
## Global p-value, Modified Holm step-down: 5.61068e-06

multiSummary(xy=xy.local, fit=fit.local, plot.margin=TRUE, pch="`")

## 
## The following tests had a p-value of less than 0.05:
## Ranked #1, Test 2928: x2 and y2 | -0.06<=x2<0.7, -0.06<=y2<0.62 (p-value=5.610678e-06)

A Signal that is Spread Between More than 2 Margins:

MultiFit also has the potential to identify complex conditional dependencies in multivariate signals.

Generate Data and Examine Margins:

Take \(\mathbf{x}\) and \(\mathbf{y}\) to be each of three dimensions, with 700 data points. We first generate a marginal circle dependency: \(x_1\), \(y_1\), \(x_2\), and \(y_2\) are all i.i.d standard normals. Take \(x_3=\cos(\theta)+\epsilon\), \(y_3=\sin(\theta)+\epsilon'\) where \(\epsilon\) and \(\epsilon'\) are i.i.d \(\mathrm{N}(0,(1/10)^2)\) and \(\theta\sim \mathrm{Uniform}(-\pi,\pi)\). I.e., the original dependency is between \(x_3\) and \(y_3\).

# Marginal signal:

# Generate data for two random vectors, each of dimension 3, 800 observations:
n=800
x = matrix(0, ncol=3, nrow=n)
y = matrix(0, ncol=3, nrow=n)

# x1, x2, y1 and y2 are all i.i.d Normal(0,1)
x[,1]=rnorm(n)
x[,2]=rnorm(n)
y[,1]=rnorm(n)
y[,2]=rnorm(n)
    
# x3 and y3 form a noisy circle:
theta = runif(n,-pi,pi)
x[,3] = cos(theta) + 0.1*rnorm(n)
y[,3] = sin(theta) + 0.1*rnorm(n)

par(mfrow=c(3,3))
par(mgp=c(0,0,0))
par(mar=c(1.5,1.5,0,0))
for (i in 1:3) {
  for (j in 1:3) {
    plot(x[,i],y[,j], col="black", pch=20, xlab=paste0("x",i), ylab=paste0("y",j),
         xaxt="n", yaxt="n")
  }
}

Next, rotate the circle in \(\pi/4\) degrees in the \(x_2\)-\(x_3\)-\(y_3\) space by applying:

\(\left[\begin{matrix}\cos(\pi/4) & -sin(\pi/4) & 0\\\sin(\pi/4) & cos(\pi/4) & 0\\ 0 & 0 & 1\end{matrix}\right]\left[\begin{matrix}| & | & |\\X_2 & X_3 & Y_3\\| & | & |\end{matrix}\right]\)

I.e., once rotated the signal is ‘spread’ between \(x_2\), \(x_3\) and \(y_3\), and harder to see through the marginal plots.

# And now rotate the circle:
phi = pi/4
rot.mat = matrix(c(cos(phi), -sin(phi),  0,
                   sin(phi),  cos(phi),  0,
                   0,         0,         1), nrow=3, ncol=3)
xxy = t(rot.mat%*%t(cbind(x[,2],x[,3],y[,3])))

x.rtt = matrix(0, ncol=3, nrow=n)
y.rtt = matrix(0, ncol=3, nrow=n)

x.rtt[,1] = x[,1]
x.rtt[,2] = xxy[,1]
x.rtt[,3] = xxy[,2]
y.rtt[,1] = y[,1]
y.rtt[,2] = y[,2]
y.rtt[,3] = xxy[,3]

par(mfrow = c(3,3))
par(mgp = c(0,0,0))
par(mar = c(1.5,1.5,0,0))
for (i in 1:3) {
  for (j in 1:3) {
    plot(x.rtt[,i],y.rtt[,j], col = "black", pch = 20, xlab = paste0("x", i),
         ylab = paste0("y", j), xaxt = "n", yaxt = "n")
  }
}

xy.rtt.circ = list(x = x.rtt, y = y.rtt)

Run the Test and Summarize the Data:

Choose R_star to be 2 to cover exhaustively all resolutions up to 2, and rnd=FALSE to consider all tests whose \(p\)-value is below p_star.

fit.rtt.circ = multiFit(xy = xy.rtt.circ, R_star = 2, verbose = TRUE)

## Applying rank transformation
## Testing and computing mid-p corrected p-values:
## Resolution 0/6: performing 9 tests
## Resolution 1/6: scanning 36 cuboids, performing 108 tests
## Resolution 2/6: scanning 432 cuboids, performing 756 tests
## Resolution 3/6: scanning 76 cuboids, performing 549 tests
## Resolution 4/6: scanning 56 cuboids, performing 450 tests
## Resolution 5/6: scanning 48 cuboids, performing 387 tests
## Resolution 6/6: scanning 8 cuboids, performing 72 tests
## No pairs of margins in resolution 6 had enough observations to be tested.
## Time difference of 0.1101279 secs
## Individual tests completed, post-processing...
## H: Computing all adjusted p-values...
## Hcorrected: Computing all adjusted p-values...
## MH: Computing CDF...
## MH: Computing all adjusted p-values...
## Time difference of 0.02711082 secs
## 
## Fisher's Exact Tests:
## Mean of -log(p-values): 1.00888
## Mean of top 4 -log(p-values): 19.4372
## Mean of -log(p-values with mid-p correction): 1.1468
## Mean of top 4 -log(p-values with mid-p correction): 19.9051
## 
## Global p-value, Holm on p-values: 2.76779e-07
## Global p-value, Holm on p-values with mid-p correction: 1.98468e-07
## Global p-value, Modified Holm step-down: 1.28572e-07

multiSummary(xy = xy.rtt.circ, fit = fit.rtt.circ, alpha = 0.001)

## 
## The following tests had a p-value of less than 0.001:
## Ranked #1, Test 684: x3 and y3 | -2.7<=x2<-0.07, -1.23<=y3<-0.11 (p-value=1.285722e-07)

## Ranked #2, Test 1746: x3 and y3 | -0.65<=x2<-0.07, -2.91<=x3<0.01, -0.11<=y3<1.23 (p-value=8.617062e-07)

## Ranked #3, Test 1206: x3 and y3 | 0.6<=x2<2.76, -1.23<=y3<-0.11 (p-value=3.014417e-05)

## Ranked #4, Test 708: x2 and y3 | 0.01<=x3<3.01, -1.23<=y3<-0.11 (p-value=4.012281e-05)

## Ranked #5, Test 738: x3 and y3 | -2.7<=x2<-0.07, -0.11<=y3<1.23 (p-value=6.200025e-05)

## Ranked #6, Test 1779: x2 and y3 | -2.7<=x2<-0.07, -0.59<=x3<0.01, -0.11<=y3<1.23 (p-value=6.512338e-05)

## Ranked #7, Test 1401: x2 and y3 | 0.63<=x3<3.01, -0.11<=y3<1.23 (p-value=0.0001410845)

## Ranked #8, Test 1008: x3 and y3 | -2.7<=x2<-0.07, -3<=y1<0.04, -1.23<=y3<-0.11 (p-value=0.000246823)

## Ranked #9, Test 1239: x2 and y3 | -0.07<=x2<2.76, 0.01<=x3<3.01, -1.23<=y3<-0.11 (p-value=0.0004040133)

Notice how the signal is detected both in the \(x_3\)-\(y_3\) plane and the \(x_2\)-\(y_3\) plane.

A Superimposed Signal:

Here we examine MultiFit’s ability to

detect a signal that is comprised of two sine waves in different frequencies
given a third dimension that determines which data points belong to which wave, to see if our algorithm successfully identifies this relation.

We take \(\mathbf{x}\) to be a two dimensional random variable and \(\mathbf{y}\) to be one dimensional, all with 550 data points. Define \(x_1\sim U(0,1)\), \(x_2\sim Beta(0.3,0.3)\) independent of \(x_1\), and define:

\(Y = \begin{cases} \sin(10\cdot x_1) + \epsilon, & \text{if }x_2 > 0.75\\ \sin(40\cdot x_1) + \epsilon, & \text{if }x_2\leq0.75 \end{cases}\)

Generate the Signal:

n=600
x=matrix(0,nrow=n,ncol=2)
x[,1]=runif(n)
x[,2]=rbeta(n,.3,.3)

epsilon=rnorm(n,0,0.3)

y=matrix(0,nrow=n,ncol=1)
y[,1]=sin(10*x[,1])*(x[,2]>0.5)+sin(40*x[,1])*(x[,2]<=0.5)+epsilon

par(mfrow=c(1,2))
par(mgp=c(0,0,0))
par(mar=c(1.5,1.5,0,0))
plot(x[,1],y[,1], col="black", pch=20, xlab=paste0("x1"), ylab=paste0("y1"),
         xaxt="n", yaxt="n")
plot(x[,2],y[,1], col="black", pch=20, xlab=paste0("x2"), ylab=paste0("y1"),
         xaxt="n", yaxt="n")

Test and Summarize:

fit.superimpose=multiFit(x=x, y=y)

## Warning in data.table::setDT(W): Some columns are a multi-column type (such
## as a matrix column): [4, 7, 16]. setDT will retain these columns as-is but
## subsequent operations like grouping and joining may fail. Please consider
## as.data.table() instead which will create a new column for each embedded
## column.

## Warning in data.table::setDT(W): Some columns are a multi-column type (such
## as a matrix column): [4, 7, 13, 16]. setDT will retain these columns as-
## is but subsequent operations like grouping and joining may fail. Please
## consider as.data.table() instead which will create a new column for each
## embedded column.

## Warning in data.table::setDT(W): Some columns are a multi-column type (such
## as a matrix column): [4, 7, 13, 16]. setDT will retain these columns as-
## is but subsequent operations like grouping and joining may fail. Please
## consider as.data.table() instead which will create a new column for each
## embedded column.

## Warning in data.table::setDT(W): Some columns are a multi-column type (such
## as a matrix column): [4, 7, 13, 16]. setDT will retain these columns as-
## is but subsequent operations like grouping and joining may fail. Please
## consider as.data.table() instead which will create a new column for each
## embedded column.

## Warning in data.table::setDT(W): Some columns are a multi-column type (such
## as a matrix column): [4, 7, 13, 16]. setDT will retain these columns as-
## is but subsequent operations like grouping and joining may fail. Please
## consider as.data.table() instead which will create a new column for each
## embedded column.

multiSummary(x=x, y=y, fit=fit.superimpose, alpha=0.0001)

## 
## The following tests had a p-value of less than 1e-04:
## Ranked #1, Test 3: x1 and y1 | 0<=x1<0.48 (p-value=5.400447e-14)

## Ranked #2, Test 95: x1 and y1 | 0.48<=x1<0.76, 0.51<=x2<1 (p-value=9.52744e-13)

## Ranked #3, Test 23: x1 and y1 | 0.48<=x1<0.76 (p-value=4.474281e-07)

## Ranked #4, Test 85: x1 and y1 | 0.76<=x1<1, 0.51<=x2<1 (p-value=5.850135e-07)

## Ranked #5, Test 25: x1 and y1 | 0.76<=x1<1 (p-value=1.070613e-06)

## Ranked #6, Test 127: x1 and y1 | 0.24<=x1<0.48, 0.51<=x2<1, -1.64<=y1<0.18 (p-value=3.418641e-06)

## Ranked #7, Test 195: x1 and y1 | 0.88<=x1<1, 0<=x2<0.51 (p-value=7.041063e-06)

## Ranked #8, Test 16: x2 and y1 | 0<=x1<0.24 (p-value=8.934898e-06)

Notice how the separate signals are identified in the 2\(^{nd}\), 4\(^{th}\), 6\(^{th}\) and 7\(^{th}\) ranking tests.

A Univariate Example:

In the univariate case, Ma and Mao (2017) show that the p-values for Fisher’s exact test are mutually independent under the null hypothesis of independence. Thus, we may also generate approximate and theoretical null distributions for the global test statistics that are much faster to compute, compared to the permutation null distribution.

Generate Data:

n=300
# y is a noisy quadratic function of x:
x.uv = runif(n)
y.uv = (x.uv-0.5)^2 + 0.2*rnorm(n)

plot(x.uv,y.uv, col="grey", pch="x", xlab="x", ylab="y")

xy.uv = list(x=x.uv, y=y.uv)

Test:

# Apply the test and in addition compute approximate and theoretical null distributions for the global test statistics:
fit.uv = multiFit(xy=xy.uv, uv.approx.null = TRUE, uv.exact.null = TRUE,
                  uv.null.sim = 10000L, verbose=TRUE)

## Applying rank transformation
## Testing and computing mid-p corrected p-values:
## Resolution 0/4: performing 1 tests
## Resolution 1/4: scanning 4 cuboids

## Warning in data.table::setDT(W): Some columns are a multi-column type (such
## as a matrix column): [2, 3, 4, 5, 11, 12]. setDT will retain these columns
## as-is but subsequent operations like grouping and joining may fail. Please
## consider as.data.table() instead which will create a new column for each
## embedded column.

## , performing 4 tests
## Resolution 2/4: scanning 4 cuboids

## Warning in data.table::setDT(W): Some columns are a multi-column type (such
## as a matrix column): [2, 3, 4, 5, 9, 10, 11, 12]. setDT will retain these
## columns as-is but subsequent operations like grouping and joining may fail.
## Please consider as.data.table() instead which will create a new column for
## each embedded column.

## , performing 4 tests
## Resolution 3/4: scanning 4 cuboids

## Warning in data.table::setDT(W): Some columns are a multi-column type (such
## as a matrix column): [2, 3, 4, 5, 9, 10, 11, 12]. setDT will retain these
## columns as-is but subsequent operations like grouping and joining may fail.
## Please consider as.data.table() instead which will create a new column for
## each embedded column.

## , performing 4 tests
## No potential parents in resolution 3 have p-values below threshold.
## Time difference of 0.008395672 secs
## Individual tests completed, post-processing...
## H: Computing all adjusted p-values...
## Hcorrected: Computing all adjusted p-values...
## MH: Computing CDF...
## MH: Computing all adjusted p-values...
## Time difference of 0.0004005432 secs
## Simulating an approximate null distribution...
## Time difference of 0.2118592 secs
## Simulating from the theoretical null distribution...
## Time difference of 0.7392795 secs
## 
## Fisher's Exact Tests:
## Mean of -log(p-values): 1.31783
## Mean of top 4 -log(p-values): 3.04394
## Mean of -log(p-values with mid-p correction): 1.48881
## Mean of top 4 -log(p-values with mid-p correction): 3.26622
## 
## Global p-value, theoretical null, mean -log(p-values with mid-p correction): 0.0224
## Global p-value, theoretical null, mean top 4 -log(p-values with mid-p correction): 0.0139
## 
## Global p-value, approximate null, mean -log(p-values with mid-p correction): 0.0647
## Global p-value, approximate null, mean top 4 -log(p-values with mid-p correction): 0.0286
## 
## Global p-value, Holm on p-values: 0.00659055
## Global p-value, Holm on p-values with mid-p correction: 0.00507617
## Global p-value, Modified Holm step-down: 0.00480701

Important Parameters for `MultiFit`:

p_star: A real number between 0 and 1, a fixed threshold. Only tests with p-value below p_star will be further halved and their descendants tested.
R_max: A positive integer (or Inf), the maximal number of resolutions to scan (algorithm will stop at a lower resolution if all tables in it do not meet the criteria specified at min.tbl.tot, min.row.tot and min.col.tot. The default value for the R_max is \(\lfloor\log_2(n/10)\rfloor\) where \(n\) is the number of observations in the vectors.
R_star: A positive integer, if set to an integer between 0 and R_max, all tests up to and including resolution R_star will be performed (algorithm will stop at a lower resolution than requested if all tables in it do not meet the criteria specified at min.tbl.tot, min.row.tot and min.col.tot). For higher resolutions the children of tables with \(p\)-values more extreme than p_star will be selected.
rank.transform: Logical, if TRUE, marginal rank transform is performed on all margins of \(\mathbf{x}\) and . If FALSE, all margins are scaled to 0-1 scale. When FALSE, the average and top statistics of the negative logarithm of the p-values are only computed for the univariate case.
test.method: String, choose “Fisher” for Fisher’s exact test (slowest), “chi.sq” for \(\chi^2\) test, “LR” for likelihood-ratio test and “norm.approx” for approximating the hypergeometric distribution with a normal distribution (fastest).
correct: Logical, if TRUE compute mid-p corrected p-values for Fisher’s exact test, or Yates corrected p-values for the \(\chi^2\) test, or Williams corrected p-values for the likelihood-ratio test.
min.tbl.tot: Non-negative integer, the minimal number of observations per table below which a p-value for a given table will not be computed.
min.row.tot: Non-negative integer, the minimal number of observations for row totals in the \(2\times2\) contingency tables below which a contingency table will not be tested.
min.col.tot: Non-negative integer, the minimal number of observations for column totals in the \(2\times2\) contingency tables below which a contingency table will not be tested.
p.adjust.methods: String, choose between "H" for Holm, "Hcorrected" for Holm with the correction as specified in correct, or "MH" for Modified-Holm (for Fisher’s exact test only).
compute.all.holm: Logical, if FALSE, only global p-value is computed (may be faster, especially when Modified-Holm correction is used). If TRUE adjusted p-values are computed for all tests.
cutoff: Numerical between 0 and 1, an upper limit for the p-values that are to be adjusted (the lower the cutoff - the fewer computations are required for the Modified Holm method).
top.max.ps: Positive integer, report the mean of the top top.max.ps order statistics of the negative logarithm of all p-values.
uv.approx.null: Logical, in a univariate case, if TRUE and the testing method is either Fisher’s exact test or the normal approximation of the hypergeometric distribution, an approximate null distribution for the global test statistics is simulated.
uv.exact.null: Logical, in a univariate case, if TRUE and the testing method is either Fisher’s exact test or the normal approximation of the hypergeometric distribution, a theoretical null distribution for the global test statistics is simulated.
uv.null.sim: Positive integer, the number of simulated values to be computed in a univariate case when a theoretical or approximate null distribution is simulated.

The latter is the most powerful and computationaly intense of the three↩

MultiFIT: Multiscale Fisher’s Independence Test for Multivariate Dependence

S. Gorsky and L. Ma

Examples

First Example:

Generate Data:

Run the Test:

Summarize Results (1):

Summarize Results (2):

Test More Cuboids:

A Local Signal:

Generate a Local Signal:

Search for It and Summarize the Results:

A Signal that is Spread Between More than 2 Margins:

Generate Data and Examine Margins:

Run the Test and Summarize the Data:

A Superimposed Signal:

Generate the Signal:

Test and Summarize:

A Univariate Example:

Generate Data:

Test:

Important Parameters for MultiFit:

Important Parameters for `MultiFit`: