Getting started with wrProteo

Wolfgang Raffelsberger

2020-07-17

Introduction

This package contains a collection of various tools for Proteomics used at the proteomics platform of the IGBMC. To get started, we need to load the packages “wrMisc” and this package (wrProteo), both are available from CRAN. The packages wrGraph and RColorBrewer get used internally by some of the functions from this package for (optional/improved) figures. Furthermore, the Bioconductor package limma will be used internally for statistical testing

library("wrProteo")
# This is wrProteo version no :
packageVersion("wrProteo")
#> [1] '1.1.3'

Calculating Molecular Masses From Composition Formulas

Please note that molecular masses may be given in two flavours : Monoisotopic mass and average mass. Please refer eg to Wikipedia: monoisotopic mass for details.

Molecular masses based on (summed) chemical formulas

At this level (summed) atomic compositions are evaluated. Here, the number of atoms has to be written before the atom. Thus, ‘2C’ means two atoms carbon. Empty or invalid entries will be by default returned as mass=0.0, a message will comment such issues.

The mass of an electron can be assigned using ‘e’.

massDeFormula(c("12H12O", "HO", " 2H 1 Se, 6C 2N", "HSeCN", " ", "e"))
#>  -> massDeFormula :  can't find ' ' .. setting to 0 mass
#>       12H12O         1H1O  2H1Se1e6C2N  1H1Se1e1C1N                        1e 
#> 2.040329e+02 1.700274e+01 1.819389e+02 1.069280e+02 0.000000e+00 5.485799e-04
# Note, that empty/invalid entries will be returned as a mass of 0.0 .

# Ignore empty/invalid entries
massDeFormula(c("12H12O", "HO", " 2H 1 Se, 6C 2N", "HSeCN"), rmEmpty=TRUE)
#>      12H12O        1H1O 2H1Se1e6C2N 1H1Se1e1C1N 
#>   204.03288    17.00274   181.93887   106.92797

Using the argument massTy one can switch this and similar functions from this package from default monoisotopic mass to average mass mode :

massDeFormula(c("12H12O", "HO", " 2H 1 Se, 6C 2N", "HSeCN"), massTy="aver")
#>      12H12O        1H1O 2H1Se1e6C2N 1H1Se1e1C1N 
#>   204.08808    17.00734   181.05403   105.98589

Molecular masses based on amino-acid sequence

The mass of these amino-acids can be used:

AAmass()
#>         A         R         N         D         C         E         Q         G 
#>  71.03711 156.10111 114.04293 115.02694 103.00918 129.04259 128.05858  57.02146 
#>         H         I         L         K         M         F         P         S 
#> 137.05891 113.08406 113.08406 128.09496 131.04048 147.06841  97.05276  87.03203 
#>         T         W         Y         V 
#> 101.04768 186.07931 163.06333  99.06841

Here the one-letter amino-acid code is used to descibre peptides or proteins.

## mass of peptide (or protein)
pep1 <- c(aa="AAAA",de="DEFDEF")
convAASeq2mass(pep1, seqN=FALSE)
#>       aa       de 
#> 302.1590 800.2865

Reading Fasta Files (from Uniprot)

This package contains a parser for Fasta-files allowing to separate different fields of meta-data like IDs, Name and Species. Here we will read a tiny example fasta-file obtained from a collection of typical contaminants in proteomics.

path1 <- system.file('extdata', package='wrProteo')
fiNa <-  "conta1.fasta"
## basic reading of Fasta
fasta1 <- readFasta2(file.path(path1,fiNa))
str(fasta1)
#>  Named chr [1:35] "FPTDDDDKIVGGYTCAANSIPYQVSLNSGSHFCGGSLINSQWVVSAAHCYKSRIQVRLGEHNIDVLEGNEQFINAAKIITHPNFNGNTLDNDIMLIKLSSPATLNSRVATV"| __truncated__ ...
#>  - attr(*, "names")= chr [1:35] "TRYP_PIG TRYPSIN PRECURSOR" "TRY1_BOVIN Cationic trypsin (Fragment) - Bos taurus (Bovine)" "CTRA_BOVIN Chymotrypsinogen A - Bos taurus (Bovine)" "CTRB_BOVIN Chymotrypsinogen B - Bos taurus (Bovine)" ...

## now let's read and further separate annotation-fields
fasta2 <- readFasta2(file.path(path1,fiNa), tableOut=TRUE)
str(fasta2)
#>  chr [1:35, 1:9] "generic" "generic" "generic" "generic" "generic" ...
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : NULL
#>   ..$ : chr [1:9] "database" "uniqueIdentifier" "entryName" "proteinName" ...

Annotation

In most ‘Omics’ acitivities getting additional annotation may get a bit tricky. In Proteomics most mass-spectrometry software will use the informaton provided in the Fasta-file as annotation (as provided from UniProt). But this lacks for example chromosomal location information. There are are many repositories with genome-, gene- and protein-annotation and most of them are linked, but sometimes the links get broken when data-base updates are not done everywhere or are not followed by new re-matching. The Fasta-files used initially for mass-spectrometry peak-identification may be (slightly) not up to date (sometimes gene-or protein IDs do change or may even disappear) and thus will contribute to a certain percentage of entries hard to link.

In the context of adding chromosomal annotation to a list of proteins here the following concept is developed : Annotation-tables from UCSC are available for a good number of species and can be downloaded for conventient off-line search. Howwever, in the context of less common species we realized that the UniProt tables from UCSC had many times low yield in final matching. For this reason we propose the slightly more complicated route that provided finally a much higher success-rate to find chromosomal locations for a list of UniProt IDs. First one needs to download from UCSC the table corresponding to the species of question fields ‘clade’,‘genome’,‘assembly’). For ‘group’ choose ‘Genes and Gene Predictions’ and for ‘track’ choose ‘Ensembl Genes’, as table choose ‘ensGene’. In addition, it is possible to select either the entire genome-annotation or user-specified regions. In terms of ‘output-format’ one may choose ‘GTF’ (slightly more condensed, no headers) or ‘all filds from selected table’.

The strategy presented here :

Locate & download organism annotation from UCSC, read into R (readUCSCtable()) -> from R export (non-redundant) ‘enst’-IDs (still readUCSCtable() ), get corresponding UniProt-IDs at UniProt site, save as file and import result into R (readUniProtExport() ) -> (in R) combine with initial UCSC table (readUniProtExport() )

The function readUCSCtable() is able to read such files downloaded from UCSC, compressed .gz files can be read, too. In the example below we’ll just look at chromosome 11 of the human genome - to keep files small.

path1 <- system.file("extdata", package="wrProteo")
gtfFi <- file.path(path1, "UCSC_hg38_chr11extr.gtf")
UcscAnnot1 <- readUCSCtable(gtfFi)
#>  -> readUCSCtable :  File 'UCSC_hg38_chr11extr.gtf' is gtf : TRUE
#>  -> readUCSCtable :  simplifed from 2000 to 223 non-redundant gene_id

# The Ensemble transcript identifyers and their chromosomal locations :
head(UcscAnnot1)
#>      gene_id              chr     start    end      strand frame
#> [1,] "ENST00000270115.8"  "chr11" "537527" "554912" "+"    "."  
#> [2,] "ENST00000308020.6"  "chr11" "450279" "491399" "+"    "."  
#> [3,] "ENST00000311189.8"  "chr11" "532242" "535576" "-"    "."  
#> [4,] "ENST00000312165.5"  "chr11" "278570" "285304" "+"    "."  
#> [5,] "ENST00000325113.8"  "chr11" "196738" "200258" "+"    "."  
#> [6,] "ENST00000325147.13" "chr11" "202924" "207382" "-"    "."

However, this annotation does not provide protein IDs. In order to obtain the corresponding protein IDs an additional step is required : Here we will use the batch seach/conversion tool from UniProt. In order to do so, we can export directly from readUCSCtable() a small text-file which can the be fed into the UniProt batch-search tool.

# Here we'll read the UCSC table and immediatley write the file for UniProt conversion 
#  (here to tempdir() to keep things tidy)
expFi <- file.path(tempdir(),"deUcscForUniProt2.txt")
UcscAnnot1 <- readUCSCtable(gtfFi, exportFileNa=expFi)
#>  -> readUCSCtable :  File 'UCSC_hg38_chr11extr.gtf' is gtf : TRUE
#>  -> readUCSCtable :  simplifed from 2000 to 223 non-redundant gene_id
#>  -> readUCSCtable :  Write to file : ENST00000270115, ENST00000308020, ENST00000311189, ENST00000312165 ...
#>  -> readUCSCtable :  Exporting file  'C:/Users/wraff/AppData/Local/Temp/RtmpKk1Gi1/deUcscForUniProt2.txt'  for conversion on https://www.uniprot.org/uploadlists

Now everything is ready to go to UniProt for retreiving the corresponding UniProt-IDs. Since we exported Ensemble transcript IDs (ENSTxxx), select converting from ‘Ensembl Transcript’ to ‘UniProtKB’. Then, when downloading the conversion results, choose tab-separated file format (compression is recommended), this may take several seconds (depening on the size).

It is suggested to rename the downloaded file so one can easily understand it’s content. Note, that the function readUniProtExport() can also read .gz compressed files. To continue this vignette we’ll use a result which has been downloaded from UniProt and renamed to ‘deUniProt_hg38chr11extr.tab’. One may also optionally define a specific genomic region of interest using the argument ‘targRegion’, here the entire chromosome 11 was chosen.

deUniProtFi <- file.path(path1, "deUniProt_hg38chr11extr.tab")
deUniPr1 <- readUniProtExport(deUniProtFi, deUcsc=UcscAnnot1, targRegion="chr11:1-135,086,622")
#>  -> readUniProtExport :  low yield matching 'enst00000410108', 'enst00000342878' and 'enst00000325113,enst00000525282' and 'ENST00000270115', 'ENST00000308020' and 'ENST00000311189' convert all to lower case and remove version numbers ('xxx.2') for better matching
#>  -> readUniProtExport :  intergrating genomic information for 88 entries (14 not found)
#>  -> readUniProtExport :  88 IDs in output
str(deUniPr1)
#> 'data.frame':    88 obs. of  14 variables:
#>  $ EnsTraID     : chr  "enst00000410108" "enst00000342878" "enst00000325113,enst00000525282" "enst00000342593" ...
#>  $ UniProtID    : chr  "B8ZZS0" "Q8TD33" "Q96PU9" "F8W6Z3" ...
#>  $ Entry.name   : chr  "B8ZZS0_HUMAN" "SG1C1_HUMAN" "ODF3A_HUMAN" "F8W6Z3_HUMAN" ...
#>  $ Status       : chr  "unreviewed" "reviewed" "reviewed" "unreviewed" ...
#>  $ Protein.names: chr  "BET1-like protein" "Secretoglobin family 1C member 1 (Secretoglobin RYD5)" "Outer dense fiber protein 3 (Outer dense fiber of sperm tails protein 3) (Sperm tail protein SHIPPO 1) (Transcr"| __truncated__ "Outer dense fiber protein 3" ...
#>  $ Gene.names   : chr  "BET1L" "SCGB1C1" "ODF3 SHIPPO1 TISP50" "ODF3" ...
#>  $ Length       : int  152 95 254 127 102 111 92 58 88 148 ...
#>  $ chr          : chr  "chr11" "chr11" NA "chr11" ...
#>  $ start        : num  167784 193078 NA 197303 202924 ...
#>  $ end          : num  207383 194575 NA 200033 207382 ...
#>  $ strand       : chr  "-" "+" NA "+" ...
#>  $ frame        : chr  "." "." NA "." ...
#>  $ avPos        : num  187584 193826 NA 198668 205153 ...
#>  $ inTarg       : logi  FALSE FALSE NA FALSE FALSE FALSE ...

The resulting data.frame (ie the column ‘UniProtID’) may be used to complement protein annotation after importing mass-spectrometry peak- and protein-identification results. Obviously, using recent Fasta-files from UniProt for protein-identification will typically give better matching at the end.

You may note that sometimes Ensemble transcript IDs are written as ‘enst00000410108’ wheras at other places it may be written as ‘ENST00000410108.5’. The function readUniProtExport() switches to a more flexible search mode stripping of version-numbers and reading all as lower-caps, if initial direct matching reveils less than 4 hits.

Finally, it should be added, that of course other ways of retreiving annotation exist, in particular using the annotation-packages provided by Bioconductor.

Importing and Treating Data From Quantitative Proteomics

This package provides support for importing quantitation results from Proteome Discoverer, MaxQuant and Proline. All quantitation import functions offer special features for separating species for optional different treatment. Normalization aims to eliminate/reduce variablity introduced to the data not linked to the original biological question. Of course, a well designed experiment with sufficent replicates is crucial for performing efficient normalization.
Technical replicates are very frequently found in proteomics, they allow to asses the influence of repeated injection of the same material. Biological replicates allow interpreting experiments in a more general way. Multiple options for normalization are available in the package wrMisc and can be called when importing data. Global median or global mean normalization have proved to provide robust results in numerous settings. However, they depend on the hypothesis that there are no global changes. This implies that the number of proteins changing abundance in a strong way should be small.

Typical contaminants in proteomics like keratins are very abudant and thus may influence global mean normalization. The import functions from this package let’s the user to choose if all proteins or a subset should be considered when determining normalization factors. For example, you may want to exclude contaminants from normalization since their dyamics may reflect other cofactors than the biological question. In the case of benchmark-tests it is very common to different species, like the human spike-in set UPS-1.
In such cases you may orient normalization on a primary species to avoid having high concentrations of spike-in standards influencing the normalization.

All import functions generate lists separating (selected) annotation data (annot) from normalized log2-quantitation data (quant) and initial quantitation (raw).

Read data from Proteome Discoverer

Proteome Discoverer is commercial software from ThermoFisher. In Proteome Discoverer quantitation data on level of consensus-proteins should be exported to tabulated text files, which can be treated by this function. Note, for this vignette we have only extracted a few hundred lines to make the data-set easier to manipulate.

path1 <- system.file("extdata", package="wrProteo")
fiNaPd <- "exampleProtDiscov1.txt"
dataPD <- readPDExport(file=fiNaPd, path=path1)

## a summary of the quantitation data
summary(dataPD$quant)
#>      S1rep1          S1rep2          S1rep3          S2rep1     
#>  Min.   :17.63   Min.   :17.49   Min.   :18.37   Min.   :17.15  
#>  1st Qu.:20.18   1st Qu.:20.23   1st Qu.:20.03   1st Qu.:20.27  
#>  Median :21.72   Median :21.72   Median :21.72   Median :21.72  
#>  Mean   :21.75   Mean   :21.83   Mean   :21.71   Mean   :21.90  
#>  3rd Qu.:23.06   3rd Qu.:23.19   3rd Qu.:23.02   3rd Qu.:23.29  
#>  Max.   :28.16   Max.   :28.29   Max.   :28.16   Max.   :28.70  
#>  NA's   :1       NA's   :3       NA's   :1       NA's   :2      
#>      S2rep2          S2rep3     
#>  Min.   :17.58   Min.   :17.14  
#>  1st Qu.:20.23   1st Qu.:20.09  
#>  Median :21.72   Median :21.72  
#>  Mean   :21.86   Mean   :21.78  
#>  3rd Qu.:23.26   3rd Qu.:23.13  
#>  Max.   :28.65   Max.   :28.58  
#>                  NA's   :2

Read data from MaxQuant

MaxQuant is free software provided by the Max-Planck-Insutute, see Tyanova et al 2016 (doi: 10.1093/bioinformatics/btaa118). Typically MaxQuant exports by default quantitation data on level of consensus-proteins as a folder called txt with a file called proteinGroups.txt . So in a standard case one needs only to provide the path to this file. However, for this vignette we have only extracted a few hundred lines to make the data-set easier to manipulate, and thus it has gotten a different name.

path1 <- system.file("extdata", package="wrProteo")
fiNaMa <- "proteinGroupsMaxQuantUps1.txt"
specPref1 <- c(conta="conta|CON_|LYSC_CHICK", mainSpecies="YEAST", spike="HUMAN_UPS")
dataMQ <- readMaxQuantFile(path1, file=fiNaMa, specPref=specPref1)
#>    by species : conta: 9  mainSpe: 245  spike: 48

## a summary of the quantitation data
summary(dataMQ$quant)
#>    25000am.1       25000am.2       25000am.3        2500am.1    
#>  Min.   :20.95   Min.   :21.22   Min.   :20.96   Min.   :19.52  
#>  1st Qu.:27.01   1st Qu.:27.01   1st Qu.:26.97   1st Qu.:26.96  
#>  Median :27.66   Median :27.66   Median :27.66   Median :27.66  
#>  Mean   :27.74   Mean   :27.74   Mean   :27.74   Mean   :27.33  
#>  3rd Qu.:28.44   3rd Qu.:28.44   3rd Qu.:28.48   3rd Qu.:28.48  
#>  Max.   :33.20   Max.   :33.19   Max.   :33.11   Max.   :33.10  
#>  NA's   :2       NA's   :2       NA's   :3       NA's   :10     
#>     2500am.2        2500am.3        250am.1         250am.2     
#>  Min.   :18.69   Min.   :19.53   Min.   :18.64   Min.   :17.98  
#>  1st Qu.:26.99   1st Qu.:26.91   1st Qu.:27.08   1st Qu.:27.08  
#>  Median :27.66   Median :27.66   Median :27.66   Median :27.66  
#>  Mean   :27.32   Mean   :27.29   Mean   :27.70   Mean   :27.66  
#>  3rd Qu.:28.49   3rd Qu.:28.48   3rd Qu.:28.53   3rd Qu.:28.54  
#>  Max.   :33.22   Max.   :33.16   Max.   :33.11   Max.   :33.13  
#>  NA's   :9       NA's   :10      NA's   :39      NA's   :38     
#>     250am.3     
#>  Min.   :18.75  
#>  1st Qu.:27.06  
#>  Median :27.66  
#>  Mean   :27.71  
#>  3rd Qu.:28.47  
#>  Max.   :33.08  
#>  NA's   :41

Read data from Proline

Proline is free software provided by the Profi-consortium, see Bouyssié et al 2020 (doi: 10.1038/nprot.2016.136). In Proline quantitation data on level of consensus-proteins can be exported to csv or tabulated text files, which can be treated by this function. Note, for this vignette we have only extracted a few hundred lines to make the data-set easier to manipulate.

path1 <- system.file("extdata", package="wrProteo")
fiNaPl <- "exampleProlineABC.csv"
dataPL <- readProlineFile(file.path(path1,fiNaPl))
#> -> readProlineFile -> extrColsDeX :  Can't find column(s) 'protein_set.score' in annotCol

## a summary of the quantitation data
summary(dataPL$quant)
#>       A_01            A_02            A_03            A_04      
#>  Min.   :14.26   Min.   :14.09   Min.   :14.10   Min.   :14.35  
#>  1st Qu.:19.26   1st Qu.:19.26   1st Qu.:19.22   1st Qu.:19.37  
#>  Median :20.80   Median :20.67   Median :20.72   Median :20.65  
#>  Mean   :20.98   Mean   :20.95   Mean   :20.93   Mean   :21.00  
#>  3rd Qu.:22.50   3rd Qu.:22.46   3rd Qu.:22.54   3rd Qu.:22.48  
#>  Max.   :28.66   Max.   :28.62   Max.   :28.61   Max.   :28.73  
#>  NA's   :29      NA's   :27      NA's   :21      NA's   :27     
#>       B_01            B_02            B_03            B_04      
#>  Min.   :15.78   Min.   :12.15   Min.   :15.32   Min.   :13.37  
#>  1st Qu.:18.83   1st Qu.:18.90   1st Qu.:18.78   1st Qu.:18.61  
#>  Median :20.30   Median :20.22   Median :20.26   Median :20.17  
#>  Mean   :20.43   Mean   :20.42   Mean   :20.42   Mean   :20.33  
#>  3rd Qu.:22.03   3rd Qu.:22.08   3rd Qu.:22.00   3rd Qu.:21.94  
#>  Max.   :27.61   Max.   :27.74   Max.   :27.75   Max.   :27.75  
#>  NA's   :77      NA's   :77      NA's   :76      NA's   :73     
#>       C_01            C_02            C_03            C_04      
#>  Min.   :14.44   Min.   :14.35   Min.   :14.69   Min.   :13.84  
#>  1st Qu.:19.31   1st Qu.:19.32   1st Qu.:19.18   1st Qu.:19.41  
#>  Median :20.92   Median :20.85   Median :20.77   Median :20.80  
#>  Mean   :21.08   Mean   :21.06   Mean   :21.03   Mean   :21.10  
#>  3rd Qu.:22.65   3rd Qu.:22.63   3rd Qu.:22.66   3rd Qu.:22.65  
#>  Max.   :28.85   Max.   :28.84   Max.   :28.83   Max.   :28.83  
#>  NA's   :22      NA's   :18      NA's   :20      NA's   :21

Normalization

As mentioned, the aim of normalization is to remove bias in data not linked to the original (biological) question. The import functions presented above do already by default global median normalization. In overal it is important to inspect results from normalization, graphical display of histograms, boxplots or violin-plots usually work well to compare distributions. Multiple options exist for normalizing data, for example the function normalizeThis from the package wrMisc can be used to run additional normalization.

Different normalization procedures intervene with different ‘aggressiveness’, ie capacity of deformining the initial data. In general we suggest to start normalizing using ‘milder’ procedures, like global median and switch to more intervening methods if initial results seem not satisfactory. Beware, heavy normalization procedures may also alter the main information you want to analyze. Ie, some biologically true positive changes may start to fade or dissapear when inappropriate normalization gets performed. Please note, that normalization should be performed before NA-inputation to avoid introducing new bias in the group of imputed values.

Imputation of NA-values

In Proteomics the quantitation of very low abundances is very challenging and typically absent or very low abundances appear in the results as 0 or NA. Typically this may be liked to the fact that no peak is detected in a MS-1 chromatogram (for a given LC elution-time) where other samples had a strong peak with succesful MS-2 identification. Data teated with MaxQuant have typically a higher degree of NA values. To simplify the treatment all 0 values are transformed to NA, anyway they would not allow log2 transformation either.

Before replacing NA-values it is important to verify that such values may be associated to absent or very low abundances. To do so, we suggest to inspect groups of replicate-measurments. In particular with multiple technical replicates of the same sample it is supposed that any variability observed is not linked to the sample itself. So for each NA that occurs in the data we suggest to look what was reported for the same protein with the other (technical) replicates. This brings us to the term of ‘NA-neighbours’ (quantifications for the same protein in replicates). When drawing histograms of NA-neighbours one can visually inspect and verify that NA-neighbours are typically low abundance values, however, but not necessarily the lowest values observed in the entire data-set.

This package proposes two related strategies for NA-imputation. First, the classical imputation of NA-values using Normal distributed random data is presented. Then, a more elaborate version based on repeated implementations to obtain more robust results is presented.

## Let's inspect NA values as graphic
matrixNAinspect(dataPD$quant, gr=gl(2,3), tit="Tiny example data from ProteomeDiscoverer") 

## Let's inspect NA values as graphic
matrixNAinspect(dataMQ$quant, gr=gl(3,3), tit="Tiny example data from MaxQuant") 

## Let's inspect NA values as graphic
matrixNAinspect(dataPL$quant, gr=as.factor(substr(colnames(dataPL$quant),1,1)),
  tit="Tiny example data from Proline") 

So only if the hypothesis of NA-neighbours as typically low abundance values gets confirmed by visual inspection of the histograms, one may proceed to replacing them by low random values. If one uses a unique (very) low value for NA-replacements, this will easily pose a problem when t-tests should be employed to look for proteins changing abundance between two or more groups of samples. Therefore it is common practice to draw random values from a Normal distribution representing this lower end of abundance values. Nevertheless, the choice of the parameters of this Normal distribution is very delicate.

The functions proposed here offer automatic selection of these parameters, which have been tested in a number of different projects. However, this choice should be checked by critically inpecting the histograms of ‘NA-neighbours’ (ie successful quantitations in other replicate samples of the same protein) and the final resulting distribution. Initially all NA-neighbours are extracted. It is also worth mentioning that in the majority of data-sets we encountered, such NA-neighbours form skewed distributions. Occasionally one may also find instances where 2 out of 3 (or more replicates) are NA. The successful quantitations of such instances with 2 NA-values per group are considered even more representative, but of course much less observed values remain. Thus a primary choice is made: If the selection of (min) 2 NA-values per group has more than 300 values, this distribution will be used as base to model the distribution for drawing random values. In this case, by default the 0.18 quantile of the 2 NA-neighbour distribution will be used as mean for the new Normal distribution used for NA-replacements. If the number of 2 NA-neighbours is >= 300, (by default) the 0.1 quantile all NA-neighbour values will used. Of course, the user has also the possibility to use custom choices for these parameters.

The final replacement is done on all NA values. This also includes proteins with are all NA in a given condition as well a instances of mixed successful qunatitation and NA values.

## ProteomeDiscoverer
dataPD$NArepl <- matrixNAneighbourImpute(dataPD$quant,gr=gl(2,3),
  tit="Tiny example from ProteomeDiscoverer") 
#>  -> matrixNAneighbourImpute :  n.woNA= 1785  n.NA = 9
#>     model 10 %-tile of (min 1 NA/grp) 10 NA-neighbour values
#>     imputation: mean= 18.1   sd= 0.56

## MaxQuant
dataMQ$NArepl <- matrixNAneighbourImpute(dataMQ$quant,gr=gl(3,3),tit="Tiny example from MaxQuant") 
#>  -> matrixNAneighbourImpute :  n.woNA= 2564  n.NA = 154
#>     model 10 %-tile of (min 1 NA/grp) 36 NA-neighbour values
#>     imputation: mean= 19   sd= 0.74

## Proline
dataPL$NArepl <- matrixNAneighbourImpute(dataPL$quant,gr=gl(3,4),tit="Tiny example from Proline") 
#>  -> matrixNAneighbourImpute :  n.woNA= 4012  n.NA = 488
#>     model 10 %-tile of (min 1 NA/grp) 285 NA-neighbour values
#>     imputation: mean= 16.4   sd= 0.62

However, imputing using normal distributed random data also brings the risk of occasional extreme values. In the extreme case it may happen that a given protein is all NA in one group, and by chance the random values turn out be rather high. Then the final group mean of imputed values may be even higher than the mean of another group with successfull quantitations. Of course in this case it would be a bad interpretation to consider the protein in question upregulated in the sample where initially all values were NA. To circumvent this problem there are 2 options : Firstly, one may use special filtering schemes to exclude such constellations from final results or secondly, one could repeat replacement of NA-values numerous times.

This type of filtering can be performed using presenceFilt (package wrMisc).

The function testRobustToNAimputation() allows also such repeated replacement of NA-values. See also the following section.

Statistical Testing

Statistical test ing in the context of proteomics data poses challanges similar to transcriptomics : Many times the number of replicate-samples is fairly low and the inter-measurement variability quite high. In some unfortunate cases proteins with rather constant quantities may appear as false positives when searching for proteins who’s abundance changes between two groups of samples : If the apparent variability is by chance too low, the respective standard-deviations will be low and a plain t-test may give very enthousiastic p-values.
Next to stringent filtering (previous section of this vignette) the shrinkage when estimating the intra-group/replicate variance from the Bioconductor package limma turns out very helpful, see Ritchie et al 2015 (doi: 10.1093/nar/gkv007). In this package the function eBayes() has been used an it’s use adopted to proteomics.

The function testRobustToNAimputation() performs NA-imputation and statistical testing (after repeated imputation) between all groups of samples the same time (as it would be inefficient to separate these two tasks). The tests underneith apply shrinkage from the empirical Bayes procedure from the bioconductor package limma. In addition, various formats of multiple test correction can be directly added to the results : Benjamini-Hochberg FDR, local false discovery rate (lfdr, using the package fdrtool, see Strimmer 2008 doi: 10.1093/bioinformatics/btn209), or modified testing by ROTS, etc …

The fact that a single round of NA-imputation may provoque false positives as well as false negatives, made it necessary to combine this (iterative) process of NA-imputation and subsequent testing in one single function.

## Impute NA-values repeatedly and run statistical testing after each round of imputations
testPL <- testRobustToNAimputation(dataPL$quant, gr=gl(3,4), lfdrInclude=FALSE) 
#>  -> matrixNAneighbourImpute :  n.woNA= 4012  n.NA = 488
#>     model 10 %-tile of (min 1 NA/grp) 285 NA-neighbour values
#>     imputation: mean= 16.4   sd= 0.62
#>  -> combineMultFilterNAimput :     at presenceFilt:   352 360 357   out of  375 
#>  -> combineMultFilterNAimput :     at abundanceFilt:  346 355 353
#>  -> combineMultFilterNAimput :    at NA> mean:   284, 338 and 288
## Note, this example is too small for reliable lfdr estimation (would give warning)

## test results: classical BH FDR
head(testPL$BH)
#>               1-2       1-3          2-3
#> [1,] 1.277109e-01 0.1607831 2.065986e-01
#> [2,]           NA 0.8787038           NA
#> [3,] 1.358713e-04 0.9537559 1.132843e-04
#> [4,] 2.051206e-05 0.6742809 7.391326e-06
#> [5,] 3.871227e-06 0.1925214 4.751582e-07
#> [6,] 3.886196e-07 0.3789325 4.504443e-07
sum(testPL$BH[,1] < 0.05, na.rm=TRUE)
#> [1] 242

## the data after repeated NA-imputation
head(testPL$datImp)
#>          A_01     A_02     A_03     A_04     B_01     B_02     B_03     B_04
#> [1,] 22.56945 22.19699 22.39452 22.42658 22.48411 22.66792 22.41511 22.90290
#> [2,] 17.61153 18.00776 16.35186 17.15051 16.29294 16.39084 16.34952 16.30732
#> [3,] 25.40560 25.28552 25.29061 25.33628 24.08922 24.01434 23.94017 24.86961
#> [4,] 19.35777 19.39723 19.73350 19.06219 17.77645 17.89538 18.11154 17.76764
#> [5,] 23.43934 23.55885 23.47963 23.28007 22.72747 22.61432 22.56039 22.50835
#> [6,] 25.36381 25.32356 25.17127 25.20114 24.21076 24.23250 24.10540 24.12083
#>          C_01     C_02     C_03     C_04
#> [1,] 22.97558 22.85138 22.70813 22.64852
#> [2,] 17.26139 17.44857 17.38586 17.40649
#> [3,] 25.32381 25.35219 25.33581 25.38618
#> [4,] 19.64330 19.08661 19.58276 19.89625
#> [5,] 23.64638 23.70737 23.60144 23.63291
#> [6,] 25.22358 25.07744 25.14947 25.09119

Acknowledgements

The author wants to acknowledge the support by the IGBMC (CNRS UMR 7104, Inserm U 1258), CNRS, IGBMC, Université de Strasbourg and Inserm. My collegues from the proteomics platform at the IGBMC work very commited to provide high quality mass-spectrometry data (including some of those used here). Furthermore, many very fruitful discussions with colleages on national and international level have helped to improve and disseminate the tools presented here.

Session-Info

#> R version 4.0.2 (2020-06-22)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18363)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=C                   LC_CTYPE=French_France.1252   
#> [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
#> [5] LC_TIME=French_France.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] wrProteo_1.1.3 wrMisc_1.2.4  
#> 
#> loaded via a namespace (and not attached):
#>  [1] digest_0.6.25      R.methodsS3_1.8.0  magrittr_1.5       evaluate_0.14     
#>  [5] sm_2.2-5.6         rlang_0.4.6        stringi_1.4.6      wrGraph_1.0.2     
#>  [9] limma_3.44.3       R.oo_1.23.0        R.utils_2.9.2      rmarkdown_2.3     
#> [13] RColorBrewer_1.1-2 tools_4.0.2        stringr_1.4.0      xfun_0.15         
#> [17] yaml_2.2.1         compiler_4.0.2     tcltk_4.0.2        htmltools_0.5.0   
#> [21] knitr_1.29