Predicting Functional Content

Stephen Woloszynek

2020-04-30

Currently, we provide two approaches for prediction functional content from 16S rRNA amplicon abundance data: PICRUSt, which utilizes GreenGreens 13.5 assigned OTUs, and Tax4Fun, which can handle Silva assignments.

PICRUSt

We’ll start by making functional predictions using PICRUSt on the Gevers et al. inflammatory bowel disease dataset. The dataset can be easily accessed by simply typing GEVERS, which is a list that contains an OTU table, a dataframe of metadata, and a taxonomy table. First, note that the OTU table has both rownames and column names that correspond to the metadata and taxonomy table, respectively. The latter is most important for functional prediction, since the algorithm looks for these names when mapping the OTU table to functional annotations. For PICRUSt, these names have to be OTU ids, so they’ll be long integer codes.

library(themetagenomics)
#> Loading required package: Rcpp

GEVERS$OTU[1:5,1:5]
#>            4463892 4433823 288362 4381553 345362
#> SRR1567209     344       8      0       2     95
#> SRR1636130     448       1    517      43      4
#> SRR1212542     434    1239    185       3    332
#> SRR1566548       2      72    662    2263    713
#> SRR1212635     723    2479     33       3    199

These column names correspond with the row names in the taxonomy table:

GEVERS$TAX[1:5,1:3]
#>         Kingdom       Phylum              Class                   
#> 4463892 "k__Bacteria" "p__Bacteroidetes"  "c__Bacteroidia"        
#> 4433823 "k__Bacteria" "p__Bacteroidetes"  "c__Bacteroidia"        
#> 288362  "k__Bacteria" "p__Firmicutes"     "c__Clostridia"         
#> 4381553 "k__Bacteria" "p__Bacteroidetes"  "c__Bacteroidia"        
#> 345362  "k__Bacteria" "p__Proteobacteria" "c__Gammaproteobacteria"

To run PICRUSt, we first have to download the reference files. We can choose between KO terms or COG terms, or we can simply download the entire set of files, which is what the download command defaults to. We’ll stick with KO terms for now, and we’ll download it to a temporary directly. It’s probably best for you to download them to a permanant location. If you would like to download these files manually, the repo can be found here: https://gitlab.com/sw1/themetagenomics_data/.

tmp <- tempdir()
download_ref(tmp,reference='gg_ko',overwrite=FALSE)

We now have our GreenGenes KO terms reference file in a temporary directory for easy access. Before we perform the actual prediction, we could manually normalize for OTU copy number via the cnn command, but we provide an argument within the PICRUSt function to make our approach a little more streamlined. Now, we’ll run PICRUSt. Our implementation uses Rcpp, so it’s fast, but behaves analogously to the python scripts you may be familiar with.

system.time(FUNCTIONS <- picrust(GEVERS$OTU,rows_are_taxa=FALSE,
                                 reference='gg_ko',reference_path=tmp,
                                 cn_normalize=TRUE,sample_normalize=FALSE,
                                 drop=TRUE))
#>    user  system elapsed 
#>  33.912   0.392  34.326

The sample_normalize flag simply controls whether you want raw counts or relative abundances as your output. The output is another list with 3 elements: the function table that contains the KO term counts across samples, the KEGG metadata that describes the KO terms, and PICRUSt specific metadata that has the NSTI quality control score for each OTU.

FUNCTIONS$fxn_table[1:5,1:5]
#>            K01361 K01362 K02249 K05844 K05845
#> SRR1567209      5   1498      0     31     55
#> SRR1636130      0    828      0      5     46
#> SRR1212542      0    815      0     66     65
#> SRR1566548      0   3791      0    169    158
#> SRR1212635     13   1037      0    760    187

The metadata file is a list of lists.

names(FUNCTIONS$fxn_meta)
#> [1] "KEGG_Description" "KEGG_Pathways"

which contains the descriptions for each KO term

head(FUNCTIONS$fxn_meta$KEGG_Description)
#> $K01361
#> $K01361[[1]]
#> [1] "lactocepin [ec:3.4.21.96]"
#> 
#> 
#> $K01362
#> $K01362[[1]]
#> [1] "none"
#> 
#> 
#> $K02249
#> $K02249[[1]]
#> [1] "competence protein comgg"
#> 
#> 
#> $K05844
#> $K05844[[1]]
#> [1] "ribosomal protein s6 modification protein"
#> 
#> 
#> $K05845
#> $K05845[[1]]
#> [1] "osmoprotectant transport system substrate-binding protein"
#> 
#> 
#> $K05846
#> $K05846[[1]]
#> [1] "osmoprotectant transport system permease protein"

and the hierarchy information

head(FUNCTIONS$fxn_meta$KEGG_Pathways)
#> $K01361
#> $K01361[[1]]
#> [1] "genetic information processing"   "folding, sorting and degradation"
#> [3] "chaperones and folding catalysts"
#> 
#> $K01361[[2]]
#> [1] "metabolism"      "enzyme families" "peptidases"     
#> 
#> 
#> $K01362
#> $K01362[[1]]
#> [1] "unclassified"          "metabolism"            "amino acid metabolism"
#> 
#> 
#> $K02249
#> $K02249[[1]]
#> [1] "environmental information processing"
#> [2] "membrane transport"                  
#> [3] "secretion system"                    
#> 
#> 
#> $K05844
#> $K05844[[1]]
#> [1] "genetic information processing" "translation"                   
#> [3] "ribosome biogenesis"           
#> 
#> 
#> $K05845
#> $K05845[[1]]
#> [1] "environmental information processing"
#> [2] "membrane transport"                  
#> [3] "abc transporters"                    
#> 
#> $K05845[[2]]
#> [1] "environmental information processing"
#> [2] "membrane transport"                  
#> [3] "transporters"                        
#> 
#> 
#> $K05846
#> $K05846[[1]]
#> [1] "environmental information processing"
#> [2] "membrane transport"                  
#> [3] "abc transporters"                    
#> 
#> $K05846[[2]]
#> [1] "environmental information processing"
#> [2] "membrane transport"                  
#> [3] "transporters"

Each element in these lists are named, so they’re easily accessible. For example, say we wanted information on K05846:

FUNCTIONS$fxn_meta$KEGG_Description['K05846']
#> $K05846
#> $K05846[[1]]
#> [1] "osmoprotectant transport system permease protein"
FUNCTIONS$fxn_meta$KEGG_Pathways['K05846']
#> $K05846
#> $K05846[[1]]
#> [1] "environmental information processing"
#> [2] "membrane transport"                  
#> [3] "abc transporters"                    
#> 
#> $K05846[[2]]
#> [1] "environmental information processing"
#> [2] "membrane transport"                  
#> [3] "transporters"

The hierarchy information is ordered based on its depth, so environmental information processing is the most general level, whereas abc transporters is the most specific.

Tax4Fun

If we used Silva assignments instead of GreenGenes, we no longer can use PICRUSt. Instead, we can take advantage of Tax4Fun. For this, we’ll use the David et al. time series dataset, which is accessible via the DAVID command. Like GEVERS, DAVID is a list containing an abundance table, a metadata dataframe, and a taxonomy table. Note, however, that we are no longer working with OTU IDs (since this table was created via the Dada2 pipeline).

DAVID$ABUND[1:5,1:5]
#>           00001 00002 00003 00004 00005
#> ERR531441  7232 44702     0  4562 17174
#> ERR531442  1807     0   756  5386  3899
#> ERR531443 20396 19382     0  9155 10203
#> ERR531444 14325   686     0   394 11702
#> ERR531445  3534  6751 13093  4323  3533

The column names are arbitrary codes that represent the final sequences from the Dada2 error model. Of more use to us is the difference in the taxonomy names

DAVID$TAX[1:5,1:3]
#>       Kingdom    Phylum          Class        
#> 00001 "Bacteria" "Bacteroidetes" "Bacteroidia"
#> 00002 "Bacteria" "Firmicutes"    "Clostridia" 
#> 00003 "Bacteria" "Bacteroidetes" "Bacteroidia"
#> 00004 "Bacteria" "Firmicutes"    "Clostridia" 
#> 00005 "Bacteria" "Bacteroidetes" "Bacteroidia"

Note that they no longer contain the taxonomy prefixes we saw in the GreenGenes assignments. To generate predictions, we again need to download reference files. We’ll use the same command, but change the reference argument to “silva_ko.”

tmp <- tempdir()
download_ref(tmp,reference='silva_ko',overwrite=FALSE)

Now, we’ll use the t4f command, which takes quite a few of Tax4Fun specific arguments. We can choose the protein domain classification method performed when generating the references (UPROC or PAUDA) and whether we’d prefer to use short or long read references. Unlike PICRUSt, copy number normalization occurs with respect to the KO terms and not the OTUs, but in terms of the function, the argument is the same as in the picrust command. Sample normalization also occurs during the mapping step and not after predictions are made, so your decision to sample normalize may be influenced accordingly.

system.time(FUNCTIONS <- t4f(DAVID$ABUND,rows_are_taxa=FALSE,tax_table=DAVID$TAX,
                             reference_path=tmp,type='uproc',short=TRUE,
                             cn_normalize=TRUE,sample_normalize=TRUE,drop=TRUE))
#>    user  system elapsed 
#>  10.892   0.824  11.728

The output is analogous to what we saw when using PICRUSt except for the method metadata. For PICRUSt, this contained the NSTI quality score; for Tax4Fun, on the other hand, this contains the FTU scores, the fraction of OTUs that weren’t mapped to KO terms.