Cell-type-specific resolution epigenetics using Tensor Composition Analysis in R

Elior Rahmani

2019-11-16

In the context of DNA methylation data, Tensor Composition Analysis (TCA) can extract cell-type-specific resolution (individuals by methylation sites by cell types) from tissue-level bulk methylation data (individuals by methylation sites)1. Particularly, TCA can be used for conducting powerful association studies with cell-type-specific resolution and it allows to estimate the cell-type-specific methylation levels of each individual (therefore, it conceptually emulates the scenario in which each individual in the bulk data has been profiled with a single-cell resolution and then signals were aggregated in each cell population of the individual separately).

In this tutorial, we will analyze whole-blood methylation data from the Liu et al. study with rheumatoid arthritis (GEO accession ID GSE42861)2. We will apply TCA for detecting a known cell-type-specific association of RA with methylation in CD4 T-cells, and then we will estimate and inspect the CD4-specific methylation level of each individual in the associated site. Note that this tutorial does not include a formal definition and description of the TCA model; those can be found in the manual of the R package and in the TCA paper1.

Data and setup

Our analysis is going to include the following five steps:

  1. Estimating the cell-type proportions of the individuals in the data (will be required by TCA as an input)
  2. Extracting a subset of the methylation sites from the data (for speeding up the runtime of this tutorial)
  3. Fitting the TCA model
  4. Testing rheumatoid arthritis (RA) status for cell-type-specific associations with methylation
  5. Estimating cell-type-specific methylation levels

You will first need to download the following data files:

In the first two steps of the analysis, we will be using glint, a user-friendly command-line tool for fast analysis of genome-wide DNA methylation data3. The latest release of glint can be found here. glint will allow us to estimate cell-type proportions, which we will then provide to the R code as an input. In order to speed up the total runtime of this tutorial, we will also use glint for extracting only a small subset of the methylation sites from the full data matrix.

If you wish, you may skip the first two steps of the analysis, in which case you may ignore the full data file for now (‘GSE42861_processed.txt’) and download the results of the first two steps (from here under the directory ‘glint_output’):

Finally, before starting the analysis, make sure to install the TCA R package:

install.packages("TCA")

Step 1: Estimating cell-type proportions

TCA requires measurements or estimates of the cell-type proportions of the individuals in the data. As we later note, TCA allows to re-estimate the cell-type proportions; this can be particularly useful in cases where we can obtain only very noisy estimates of the cell-type proportions.

Here, we use glint for estimating leukocyte proportions for each of the individuals in the whole-blood data following the Houseman et al. reference-based algorithm4. We run the following in the terminal / command line:

python glint.py --datafile GSE42861_processed.txt --houseman --out GSE42861_processed

This results in a file named ‘GSE42861_processed.houseman_estimates.txt’, which includes the cell-type proportion estimates.

Notes

The Houseman et al. algorithm requires methylation reference for the studied tissue (i.e. methylation collected from sorted cells of the cell types composing the same tissue type under study). While glint allows to specify methylation reference for tissues other than blood (see glint’s documentation), such reference will not always be available. In the absence of reference data, a semi-supervised approach for estimating cell-type proportions can be taken; note that other reference-free methods currently do not provide cell-type proportion estimates but rather only linear combinations of the cell-type proportions5.

Step 2: Subsetting the data

In order to speed up the runtime of the following steps in this tutorial, we will extract and analyze only a small subset of the methylation sites in the data. Specifically, we will consider only the 500 sites listed in the file named ‘GSE42861_processed.subset.cpg_identifiers.txt’ - 499 of these sites were arbitrarily selected, and one more site was included by design (cg07839457). The latter was verified in the TCA paper to demonstrate a CD4-specific association with RA using sorted CD4 cells and will be used for validating our analysis.

In the terminal / command line run glint for subsetting the full data as follows:

python glint.py --datafile GSE42861_processed.txt --include GSE42861_processed.subset.cpg_identifiers.txt --out GSE42861_processed.subset --txtsave

This results in a file named ‘GSE42861_processed.subset.txt’, which includes the 500 pre-selected sites for the 658 individuals in the study. Note that glint will also output two additional files (‘GSE42861_processed.subset.samples.txt’ and ‘GSE42861_processed.subset.sites.txt’) - these can be ignored in our analysis.

Step 3: Fitting the TCA model

At this point we have obtained estimates of the cell-type proportions (‘GSE42861_processed.houseman_estimates.txt’) and a subset of the original data file (‘GSE42861_processed.subset.txt’). We next fit the TCA model to the subset of the data. Fitting the TCA model essentially means learning the parameters of the TCA model, which will then allow us to both perform association testing (using the function ‘tcareg’) and estimate cell-type-specific methylation levels (using the function ‘tensor’).

We start by loading into R the TCA package, the data matrix (X), and the cell-type proportion estimates (W):

library(TCA)
library("data.table")
X <- data.frame(fread(file = "GSE42861_processed.subset.txt"), row.names=1)
W <- data.frame(fread(file = "GSE42861_processed.houseman_estimates.txt"), row.names=1)

The analysis of DNA methylation and the interpretation of results may be misleading in the presence of factors that are correlated with changes in methylation. For that reason, TCA allows to specify design variables of the experiment (covariates) that will be accounted for.

TCA allows to account for covariates that may affect methylation biologically at a cell-type-specific level (such as age6) using the argument C1. In addition, TCA allows to consider covariates that may affect the observed mixtures rather than the cell-type-specific methylation levels (such as batch information) using the argument C2.

In our case, we will use age, sex, and smoking status as the C1 covariates, and batch information and 10 components that capture technical variation in the data as the C2 covariates:

covars <- data.frame(
  fread(file = "GSE42861_processed.covariates.txt"), row.names=1)
tech_var <- data.frame(
  fread(file = "GSE42861_processed.control_probes_pcs.txt"),
  row.names=1 ) # principal components calculated from control probes
C1 <- covars[,2:4]  # sex, age, and smoking status
C2 <- cbind(covars[,1],tech_var[,1:10]) # batch information and technical variation

Now, we can finally fit the TCA model by running the ‘tca’ function. We will do so by indicating that we wish ‘tca’ to use multiple cores (by setting parallel = TRUE; effective only when using a multi-core machine):

tca.mdl <- tca(X = X, W = W, C1 = C1, C2 = C2,
               parallel = TRUE, log_file = "GSE42861.tca.log")

After several minutes, ‘tca’ will save the estimated parameters of the TCA model into the list ‘tca.mdl’. In addition, the logs printed to the console will be saved into a file named ‘GSE42861.tca.log’. For a full description of the arguments and the returned values of the ‘tca’ function, including additional arguments and options not described above, please refer to the package manual.

Notes

Step 4: Looking for cell-type-specific associations with RA

TCA allows us to model a phenotype of interest as statistically related to a given methylation site through cell-type-specific effects. Under this framework, as described below, we can statistically test different hypotheses about the cell-type-specific effects. We refer to this framework as TCA regression (the ‘tcareg’ function).

We start by loading the rheumatoid arthritis (RA) phenotype.

y <- as.matrix(covars[,5])

Much like in a standard regression analysis, TCA regression allows us to account for covariates (the argument C3 in the ‘tcareg’ function). Notably, we have already taken into account covariates when fitting the TCA model (C1 and C2). However, these were used for modeling the variation in methylation levels. The covariates in C3, on the other hand, can be used to account for factors that may potentially be correlated with the variation of the phenotype.

In our case, we include in C3 the variables age, sex, and smoking, as those are known to be correlated with RA status:

C3 <- covars[,2:4]  # age, sex, and smoking status

In addition, since RA is known to be correlated with leukocyte composition in blood, we should consider the cell-type composition in C3. In principle, we can simply do this by including in C3 the cell-type proportions (W). However, here we will use the ReFACTor reference-free algorithm7 instead, as it was recently shown that in the context of methylation association studies it provides better correction for cell-type composition (i.e. better control for false positives) compared with reference-based estimates8.

ReFACTor provides a set of principal components that we can include in C3. Importantly, these components are expected to capture the variation in cell-type composition through linear combinations of the cell-type proportions. As a result, while these components can be used for correction (which is invariant to linear combinations), they cannot be used as absolute estimates of the cell-type proportions and therefore should not be given as an input to the function ‘tca’ (i.e. as the argument W). A more in-depth evaluation and discussion about the output of reference-free and reference-based algorithms is available elsewhere5.

We will use the ‘refactor’ function for calculating the ReFACTor components. This will require us to use the full data matrix (219,096 methylation sites by 568 individuals). If you wish to skip this step for convenience, you can simply download the results of ‘refactor’, which can be found in the file named ‘GSE42861_processed.refactor_components.txt’ under the directory ‘refactor_output’ together with the rest of the data files.

# if you wish to skip this step, simply load the pre-computed ReFACTor components
ref.scores <- data.frame(
  fread(file = "GSE42861_processed.refactor_components.txt"), row.names=1)

# otherwise, run the followings

# load the full data matrix and run refactor
X.full <- data.frame(
  fread(file = "GSE42861_processed.txt"), row.names=1)
ref <- refactor(X.full, k = 6, sparsity = 500,
                C = cbind(C1,C2), C.remove = TRUE,
                rand_svd = TRUE, log_file = "GSE42861.refactor.log")
ref.scores <- ref$scores

For a full description of the arguments and the returned values of the ‘refactor’ function, including additional arguments and options, please refer to the package manual.

Given the ReFACTor components, we can now construct the final C3, in which we also include the first 6 ReFACTor components8:

C3 <- cbind(C3,ref.scores)

Now that we have the covariates we wish to consider in the TCA regression, we should chose a test to perform. The function ‘tcareg’ allows us to test several different types of hypotheses (tuned by the argument ‘test’):

In our case, we wish to test the methylation sites in the data for associations with CD4 cells. In other words, we wish to perform marginal tests for associations with CD4. Since setting “test = ‘marginal’” will perform a marginal association test for each of the cell types, we can instead set “test = ‘custom’” and define a marginal association test for CD4 by setting ‘alternative_model’ to be CD4 cells and setting ‘null_model’ to be ‘NULL’. This formulation requires from ‘tcareg’ that for each methylation site only a single effect for CD4 will be estimated and tested, while assuming no effects in the rest of the cell types - this is exactly a marginal test.

We can now finally test for CD4-specific associations. In order to have a more informative view of the results, we indicate using the optional argument ‘features_metadata’ that ‘tcareg’ should include metadata about the methylation sites. Specifically, we will use the metadata file named ‘HumanMethylationSites.txt’, which is based on the Illumina manifest files and can be found here, and includes information such as chromosome numbers and positions for all the methylation sites in the Illumina human methylation platforms (27K, 450K and 850K/EPIC arrays).

cd4 <- colnames(W)[1] # the first column in W corresponds to CD4
res <- tcareg(X = X, tca.mdl = tca.mdl, y = y, C3 = C3,
              test = "custom", null_model = NULL, alternative_model = c(cd4),
              save_results = TRUE, output = "GSE42861.tcareg.CD4",
              log_file = "GSE42861.tcareg.log",
              features_metadata = "HumanMethylationSites.txt")

The ‘tcareg’ results will be saved into the list ‘res’. In addition, since we indicated ‘save_results = TRUE’, the results will be saved into a file named “GSE42861.tcareg.CD4.custom.txt” (note that the file name is indicated by the argument ‘output’). For a full list of the values returned by ‘tcareg’ and for additional arguments please refer to the package manual.

We can now look at the association results - we start by plotting a quantile-quantile (qq) plot of the p-values:

plot(-log10(seq(from = 1/500, to = 1, by = 1/500)), -log10(sort(res$pvals)),
     xlab = "Expected -log(p-val)", ylab = "Observed -log(p-val)")
abline(a=0, b=1); abline(h=-log10(0.05/500), col="red")

This will output the following plot, which shows a single significant hit:

Results of the CD4 marginal test presented by a qq-plot of the -log10(pvalues).

Results of the CD4 marginal test presented by a qq-plot of the -log10(pvalues).

Ranking the p-values shows that the association found is with the methylation site cg07839457.

hit.position <- order(res$pvals)[1] # 353
hit <- rownames(X)[hit.position] # cg07839457
hit.pval <- res$pvals[hit.position]

Notes

Step 5: Estimating cell-type-specific methylation

Finally, we will apply the function ‘tensor’ for estimating the cell-type-specific methylation levels of the associated site we found in the previous step. Note that this step is not necessary given that we already detected the association, however, we will use it for visualization purposes. ‘tensor’ requires as an input the parameters of the fitted TCA model for the sites for which we wish to get cell-type-specific estimates. Since we are interested in estimates for just one particular site, we will first use the function ‘tcasub’ for subsetting the estimated model parameters to include only information about the associated site.

# subset the estimates of the model parameters
mdl.tca.sub <- tcasub(tca.mdl, features = c(hit), log_file = "GSE42861.tcasub.log")

# estimate cell-type-specific methylation for the associated site
Z_hat <- tensor(X = X[hit,], tca.mdl.sub, log_file = "GSE42861.tensor.log")

This results in a list ‘Z_hat’ in which each element includes the estimated cell-type-specific methylation levels for one specific cell type (where the order of cell types is the same as the order of the cell types in W). This allows us, for example, to compare between RA cases and controls based on their estimated CD4-specific methylation levels in cg07839457:

cd4 <- Z_hat[[1]] # the first column in W is CD4, therefore take the first element in Z_hat
hit.cd4 <- cd4[1,]
d <- data.frame(hit.cd4, y)
boxplot(hit.cd4~y, data = d, main="CD4 methylation in cg07839457",
        names=c("controls","cases"), ylab="Estimated methylation")

This results in the following boxplots:

The distribution of the estimated CD4-specific cg07839457 methylation levels in RA cases and controls presented by boxplots.

The distribution of the estimated CD4-specific cg07839457 methylation levels in RA cases and controls presented by boxplots.

Notes


1: Rahmani et al. “Cell-type-specific resolution epigenetics without the need for cell sorting or single-cell biology.” Nature Communications 2019.

2: Liu et al. “Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis.” Nature biotechnology. 2013 Feb;31(2):142.

3: Rahmani et al. “GLINT: a user-friendly toolset for the analysis of high-throughput DNA-methylation array data.” Bioinformatics. 2017 Feb 8;33(12):1870-2.

4: Houseman et al. “DNA methylation arrays as surrogate measures of cell mixture distribution.” BMC bioinformatics. 2012 Dec;13(1):86.

5: Rahmani et al. “BayesCCE: a Bayesian framework for estimating cell-type composition from DNA methylation without the need for methylation reference.” Genome biology. 2018 Dec;19(1):141.

6: Horvath “DNA methylation age of human tissues and cell types. Genome biology.” 2013 Oct;14(10):3156.

7: Rahmani et al. “Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies.” Nature methods. 2016 May;13(5):443.

8: Rahmani et al. “Correcting for cell-type heterogeneity in DNA methylation: a comprehensive evaluation.” Nature methods. 2017 Mar;14(3):218.