Accounting for correlations among measurements

Matthew Stephens

2020-06-17

Introduction

In some settings measurements and tests in different conditions may be correlated with one another. For example, in eQTL applications this can occur due to sample overlap among the different conditions.

Failure to deal with such correlations can cause false positives in a mashr analysis.

To deal with these correlations mashr allows the user to specify a correlation matrix \(V\) when setting up the data in mash_set_data. The easiest way to specify this correlation matrix is to estimate it using estimate_null_correlation_simple, which, as its name suggests, uses the null tests (specifically, tests without a strong \(z\) score) to estimate the correlations.

Illustration

Here we simulate data for illustration. The data does not have any correlation, but we will analyze it as if we did not know that.

library(ashr)
library(mashr)
set.seed(1)
simdata = simple_sims(500,5,1)

Read in the data, and estimate correlations:

data   = mash_set_data(simdata$Bhat, simdata$Shat)
V = estimate_null_correlation_simple(data)
data.V = mash_update_data(data, V=V)

Now we have two mash data objects, one (data.V) with correlations specified, and one without (data). So analyses using data.V will allow for correlations, whereas analyses using data will assume measurements are independent.

Here, for illustration purposes, we proceed to analyze the data with correlations, using just the simple canonical covariances as in the initial introductory vignette.

U.c = cov_canonical(data.V) 
m.c = mash(data.V, U.c) # fits with correlations because data.V includes correlation information 
#  - Computing 2000 x 151 likelihood matrix.
#  - Likelihood calculations took 0.04 seconds.
#  - Fitting model with 151 mixture components.
#  - Model fitting took 0.59 seconds.
#  - Computing posterior matrices.
#  - Computation allocated took 0.01 seconds.
print(get_loglik(m.c),digits=10) # log-likelihood of the fit with correlations set to V
# [1] -16121.11177

We can also compare with the original analysis. (Note that the canonical covariances do not depend on the correlations, so we can use the same U.c here for both analyses. If we used data-driven covariances we might prefer to estimate these separately for each analysis as the correlations would affect them.)

m.c.orig = mash(data, U.c) # fits without correlations because data object was set up without correlations
#  - Computing 2000 x 151 likelihood matrix.
#  - Likelihood calculations took 0.04 seconds.
#  - Fitting model with 151 mixture components.
#  - Model fitting took 0.63 seconds.
#  - Computing posterior matrices.
#  - Computation allocated took 0.02 seconds.
print(get_loglik(m.c.orig),digits=10)
# [1] -16120.32142

The log-likelihoods with and without correlations are similar here, which is expected since there are no actual correlations in the data.