Introduction to ParseMSF

Loading peptide information from a ThermoFisher MSF

The ParseMSF package provides several functions for inspecting ThermoFisher MSF files. The most useful of these functions is make_area_table, which constructs a data frame containing all peptides and their corresponding peak areas. This data frame also includes protein information (protein_desc) for each peptide.

NOTE: Only ThermoFisher MSF files generated by Proteome Discoverer 1.4.x are supported. Using ParseMSF functions with a file produced by any other version of Proteome Discoverer may produce unexpected results.

library(parsemsf)

# Replace `parsemsf_example("test_db.msf")` with the path to a ThermoFisher MSF file
area_table <- make_area_table(parsemsf_example("test_db.msf"))
knitr::kable(head(area_table))

peptide_id	spectrum_id	protein_desc	sequence	area	mass	m_z	charge	intensity	first_scan
27146	15646	NP_041997.1	AALTDQVALGK	55120084	1086.616	544.0622	2	634147.8	17577
27177	15663	NP_041997.1	AALTDQVALGK	55120084	1086.615	544.0622	2	721063.0	17595
35484	20122	NP_041997.1	ANFQADQIIAK	37046635	1218.648	610.5803	2	152654.2	22420
35511	20136	NP_041997.1	ANFQADQIIAK	37046635	1218.648	610.5803	2	169355.0	22436
37869	21360	NP_041997.1	TQAAYLAPGENLDDK	NA	1605.775	NA	2	382864.4	23744
37913	21384	NP_041997.1	TQAAYLAPGENLDDK	NA	1605.775	NA	2	282891.8	23769

See the documentation for make_area_table for a description of each column.

Estimating protein abundances

The peak area information stored in one or more ThermoFisher MSF files can be used to estimate protein abundances. The combine_tech_reps function estimates these abundances across one or more technical replicates. Technical replicates are typically different mass spectrometry injections of the same biological sample. The combine_tech_reps function will produce more accurate protein abundance estimates if it is provided with multiple technical replicates.

# Replace `parsemsf_example("test_db.msf")` with the path to a ThermoFisher MSF file
abundances <- quantitate(c(parsemsf_example("test_db.msf"), 
                           parsemsf_example("test_db2.msf")))

## Now processing:  /private/var/folders/vb/tc7jl5s13nl5x1znxrszfg2c0000gn/T/RtmpTJHmJu/Rinst1336c1644b688/parsemsf/extdata/test_db.msf

## Now processing:  /private/var/folders/vb/tc7jl5s13nl5x1znxrszfg2c0000gn/T/RtmpTJHmJu/Rinst1336c1644b688/parsemsf/extdata/test_db2.msf

## Quantitating...

knitr::kable(head(abundances))

protein_desc	area_mean	area_sd	peps_per_rep
NP_041997.1	0.0917469	0.0207773	3

Abundances are estimated by taking the top three most abundant peptides by area, and averaging them together (area_mean) (Silva et al. 2006). If provided multiple technical replicates, quantitate will, by default, estimate protein abundances by matching peptides across technical replicates. That is, it will only average areas from peptides that are present in both technical replicates. The number unique peptides used to estimate the protein abundances are given by peps_per_rep.

Protein abundances can also be estimated from a single ThermoFisher MSF File.

# Replace `parsemsf_example("test_db.msf")` with the path to a ThermoFisher MSF file
abundances <- quantitate(parsemsf_example("test_db.msf"))

## Now processing:  /private/var/folders/vb/tc7jl5s13nl5x1znxrszfg2c0000gn/T/RtmpTJHmJu/Rinst1336c1644b688/parsemsf/extdata/test_db.msf

## Quantitating...

knitr::kable(head(abundances))

protein_desc	area_mean	area_sd	peps_per_rep
NP_041997.1	0.0963672	0.0250473	3

Inspecting distribution of peptides within a protein

The ParseMSF package includes a function for inspecting the distribution of peptides within a single protein. The map_peptides function produces a data frame of peptides with their respective locations within the protein sequence.

peptide_locs <- map_peptides(parsemsf_example("test_db.msf"))

# Select columns with start and end locations
peptide_locs <- peptide_locs[c("peptide_id", "protein_desc", 
                               "peptide_sequence", "start", "end")]

knitr::kable(head(peptide_locs))

peptide_id	protein_desc	peptide_sequence	start	end
27146	NP_041997.1	AALTDQVALGK	172	182
27177	NP_041997.1	AALTDQVALGK	172	182
35484	NP_041997.1	ANFQADQIIAK	314	324
35511	NP_041997.1	ANFQADQIIAK	314	324
37869	NP_041997.1	TQAAYLAPGENLDDK	69	83
37913	NP_041997.1	TQAAYLAPGENLDDK	69	83

We can plot these peptide locations with the ggplot2 and dplyr packages.

library(ggplot2)
library(dplyr)

peptide_summary <- peptide_locs %>% 
  group_by(start, end) %>%
  summarize(spectral_count = n()) # Count peptides

pep_plot <- ggplot(peptide_summary,
       aes(x = start, xend = end, y = spectral_count, yend = spectral_count)) +
  geom_segment(size = 1) +
  ylim(0, 5) + 
  xlab("peptide position within protein") +
  ylab("peptide count")

pep_plot

Introduction to ParseMSF

Benjamin R. Jack

2017-12-09

Loading peptide information from a ThermoFisher MSF

Estimating protein abundances

Inspecting distribution of peptides within a protein

References