Introduction to ParseMSF

Benjamin R. Jack

2017-12-09

Loading peptide information from a ThermoFisher MSF

The ParseMSF package provides several functions for inspecting ThermoFisher MSF files. The most useful of these functions is make_area_table, which constructs a data frame containing all peptides and their corresponding peak areas. This data frame also includes protein information (protein_desc) for each peptide.

NOTE: Only ThermoFisher MSF files generated by Proteome Discoverer 1.4.x are supported. Using ParseMSF functions with a file produced by any other version of Proteome Discoverer may produce unexpected results.

library(parsemsf)
# Replace `parsemsf_example("test_db.msf")` with the path to a ThermoFisher MSF file
area_table <- make_area_table(parsemsf_example("test_db.msf"))
knitr::kable(head(area_table))
peptide_id spectrum_id protein_desc sequence area mass m_z charge intensity first_scan
27146 15646 NP_041997.1 AALTDQVALGK 55120084 1086.616 544.0622 2 634147.8 17577
27177 15663 NP_041997.1 AALTDQVALGK 55120084 1086.615 544.0622 2 721063.0 17595
35484 20122 NP_041997.1 ANFQADQIIAK 37046635 1218.648 610.5803 2 152654.2 22420
35511 20136 NP_041997.1 ANFQADQIIAK 37046635 1218.648 610.5803 2 169355.0 22436
37869 21360 NP_041997.1 TQAAYLAPGENLDDK NA 1605.775 NA 2 382864.4 23744
37913 21384 NP_041997.1 TQAAYLAPGENLDDK NA 1605.775 NA 2 282891.8 23769

See the documentation for make_area_table for a description of each column.

Estimating protein abundances

The peak area information stored in one or more ThermoFisher MSF files can be used to estimate protein abundances. The combine_tech_reps function estimates these abundances across one or more technical replicates. Technical replicates are typically different mass spectrometry injections of the same biological sample. The combine_tech_reps function will produce more accurate protein abundance estimates if it is provided with multiple technical replicates.

# Replace `parsemsf_example("test_db.msf")` with the path to a ThermoFisher MSF file
abundances <- quantitate(c(parsemsf_example("test_db.msf"),
parsemsf_example("test_db2.msf")))
## Now processing:  /private/var/folders/vb/tc7jl5s13nl5x1znxrszfg2c0000gn/T/RtmpTJHmJu/Rinst1336c1644b688/parsemsf/extdata/test_db.msf
## Now processing:  /private/var/folders/vb/tc7jl5s13nl5x1znxrszfg2c0000gn/T/RtmpTJHmJu/Rinst1336c1644b688/parsemsf/extdata/test_db2.msf
## Quantitating...
knitr::kable(head(abundances))
protein_desc area_mean area_sd peps_per_rep
NP_041997.1 0.0917469 0.0207773 3

Abundances are estimated by taking the top three most abundant peptides by area, and averaging them together (area_mean) (Silva et al. 2006). If provided multiple technical replicates, quantitate will, by default, estimate protein abundances by matching peptides across technical replicates. That is, it will only average areas from peptides that are present in both technical replicates. The number unique peptides used to estimate the protein abundances are given by peps_per_rep.

Protein abundances can also be estimated from a single ThermoFisher MSF File.

# Replace `parsemsf_example("test_db.msf")` with the path to a ThermoFisher MSF file
abundances <- quantitate(parsemsf_example("test_db.msf"))
## Now processing:  /private/var/folders/vb/tc7jl5s13nl5x1znxrszfg2c0000gn/T/RtmpTJHmJu/Rinst1336c1644b688/parsemsf/extdata/test_db.msf
## Quantitating...
knitr::kable(head(abundances))
protein_desc area_mean area_sd peps_per_rep
NP_041997.1 0.0963672 0.0250473 3

Inspecting distribution of peptides within a protein

The ParseMSF package includes a function for inspecting the distribution of peptides within a single protein. The map_peptides function produces a data frame of peptides with their respective locations within the protein sequence.

peptide_locs <- map_peptides(parsemsf_example("test_db.msf"))
# Select columns with start and end locations
peptide_locs <- peptide_locs[c("peptide_id", "protein_desc",
"peptide_sequence", "start", "end")]
knitr::kable(head(peptide_locs))
peptide_id protein_desc peptide_sequence start end
27146 NP_041997.1 AALTDQVALGK 172 182
27177 NP_041997.1 AALTDQVALGK 172 182
35484 NP_041997.1 ANFQADQIIAK 314 324
35511 NP_041997.1 ANFQADQIIAK 314 324
37869 NP_041997.1 TQAAYLAPGENLDDK 69 83
37913 NP_041997.1 TQAAYLAPGENLDDK 69 83

We can plot these peptide locations with the ggplot2 and dplyr packages.

library(ggplot2)
library(dplyr)
peptide_summary <- peptide_locs %>%
group_by(start, end) %>%
summarize(spectral_count = n()) # Count peptides
pep_plot <- ggplot(peptide_summary,
aes(x = start, xend = end, y = spectral_count, yend = spectral_count)) +
geom_segment(size = 1) +
ylim(0, 5) +
xlab("peptide position within protein") +
ylab("peptide count")
pep_plot

References

Silva, JC, MV Gorenstein, GZ Li, JP Vissers, and SJ Geromanos. 2006. “Absolute Quantification of Proteins by LCMSE: A Virtue of Parallel MS Acquisition.” Mol Cell Proteomics 5 (1):144–56. https://doi.org/10.1074/mcp.M500230-MCP200.