Protein Identification Using Customized Protein Sequence Databases Derived from RNA-Seq Data

Overview

Journal J Proteome Res

Publisher American Chemical Society

Specialty Biochemistry

Date 2011 Nov 23

PMID 22103967

Citations 90

Authors

Xiaojing Wang

Robbert J C Slebos

Dong Wang

Patrick J Halvey

David L Tabb

Daniel C Liebler

Bing Zhang

Affiliations

Soon will be listed here.

Abstract

The standard shotgun proteomics data analysis strategy relies on searching MS/MS spectra against a context-independent protein sequence database derived from the complete genome sequence of an organism. Because transcriptome sequence analysis (RNA-Seq) promises an unbiased and comprehensive picture of the transcriptome, we reason that a sample-specific protein database derived from RNA-Seq data can better approximate the real protein pool in the sample and thus improve protein identification. In this study, we have developed a two-step strategy for building sample-specific protein databases from RNA-Seq data. First, the database size is reduced by eliminating unexpressed or lowly expressed genes according to transcript quantification. Second, high-quality nonsynonymous coding single nucleotide variations (SNVs) are identified based on RNA-Seq data, and corresponding protein variants are added to the database. Using RNA-Seq and shotgun proteomics data from two colorectal cancer cell lines SW480 and RKO, we demonstrated that customized protein sequence databases could significantly increase the sensitivity of peptide identification, reduce ambiguity in protein assembly, and enable the detection of known and novel peptide variants. Thus, sample-specific databases from RNA-Seq data can enable more sensitive and comprehensive protein discovery in shotgun proteomics studies.

Citing Articles

Chemoproteogenomic stratification of the missense variant cysteinome.

Desai H, Andrews K, Bergersen K, Ofori S, Yu F, Shikwana F Nat Commun. 2024; 15(1):9284.

PMID: 39468056 PMC: 11519605. DOI: 10.1038/s41467-024-53520-x.

Moving Toward Metaproteogenomics: A Computational Perspective on Analyzing Microbial Samples via Proteogenomics.

Singer F, Kuhring M, Renard B, Muth T Methods Mol Biol. 2024; 2859:297-318.

PMID: 39436609 DOI: 10.1007/978-1-0716-4152-1_17.

A practical introduction to holo-omics.

Odriozola I, Rasmussen J, Gilbert M, Limborg M, Alberdi A Cell Rep Methods. 2024; 4(7):100820.

PMID: 38986611 PMC: 11294832. DOI: 10.1016/j.crmeth.2024.100820.

Transcription factors and splice factors - interconnected regulators of stem cell differentiation.

Mehlferber M, Kuyumcu-Martinez M, Miller C, Sheynkman G Curr Stem Cell Rep. 2024; 9(2):31-41.

PMID: 38939410 PMC: 11210451. DOI: 10.1007/s40778-023-00227-2.

moPepGen: Rapid and Comprehensive Identification of Non-canonical Peptides.

Zhu C, Liu L, Ha A, Yamaguchi T, Zhu H, Hugh-White R bioRxiv. 2024; .

PMID: 38585946 PMC: 10996593. DOI: 10.1101/2024.03.28.587261.

References

Edwards N . Novel peptide identification from tandem mass spectra using ESTs and sequence database compression. Mol Syst Biol. 2007; 3:102. PMC: 1865584. DOI: 10.1038/msb4100142. View

Bossi G, Lapi E, Strano S, Rinaldo C, Blandino G, Sacchi A . Mutant p53 gain of function: reduction of tumor malignancy of human cancer cell lines through abrogation of mutant p53 expression. Oncogene. 2005; 25(2):304-9. DOI: 10.1038/sj.onc.1209026. View

Milicevic Z, Bogojevic D, Mihailovic M, Petrovic M, Krivokapic Z . Molecular characterization of hsp90 isoforms in colorectal cancer cells and its association with tumour progression. Int J Oncol. 2008; 32(6):1169-78. View

Ramakrishnan S, Vogel C, Prince J, Li Z, Penalva L, Myers M . Integrating shotgun proteomics and mRNA expression data to improve protein identification. Bioinformatics. 2009; 25(11):1397-403. PMC: 2682515. DOI: 10.1093/bioinformatics/btp168. View

Fermin D, Allen B, Blackwell T, Menon R, Adamski M, Xu Y . Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics. Genome Biol. 2006; 7(4):R35. PMC: 1557991. DOI: 10.1186/gb-2006-7-4-r35. View

Lundberg E, Fagerberg L, Klevebring D, Matic I, Geiger T, Cox J . Defining the transcriptome and proteome in three functionally different human cell lines. Mol Syst Biol. 2010; 6:450. PMC: 3018165. DOI: 10.1038/msb.2010.106. View

Li J, Su Z, Ma Z, Slebos R, Halvey P, Tabb D . A bioinformatics workflow for variant peptide detection in shotgun proteomics. Mol Cell Proteomics. 2011; 10(5):M110.006536. PMC: 3098595. DOI: 10.1074/mcp.M110.006536. View

Vogel C, de Sousa Abreu R, Ko D, Le S, Shapiro B, Burns S . Sequence signatures and mRNA concentration can explain two-thirds of protein abundance variation in a human cell line. Mol Syst Biol. 2010; 6:400. PMC: 2947365. DOI: 10.1038/msb.2010.59. View

Greenbaum D, Colangelo C, Williams K, Gerstein M . Comparing protein abundance and mRNA expression levels on a genomic scale. Genome Biol. 2003; 4(9):117. PMC: 193646. DOI: 10.1186/gb-2003-4-9-117. View

10.

Fu X, Fu N, Guo S, Yan Z, Xu Y, Hu H . Estimating accuracy of RNA-Seq and microarrays with proteomics. BMC Genomics. 2009; 10:161. PMC: 2676304. DOI: 10.1186/1471-2164-10-161. View

11.

Griffin T, Gygi S, Ideker T, Rist B, Eng J, Hood L . Complementary profiling of gene expression at the transcriptome and proteome levels in Saccharomyces cerevisiae. Mol Cell Proteomics. 2002; 1(4):323-33. DOI: 10.1074/mcp.m200001-mcp200. View

12.

Tabb D, Fernando C, Chambers M . MyriMatch: highly accurate tandem mass spectral peptide identification by multivariate hypergeometric analysis. J Proteome Res. 2007; 6(2):654-61. PMC: 2525619. DOI: 10.1021/pr0604054. View

13.

Park K, Byun H, Won M, Yang K, Shin S, Piao L . Sustained activation of protein kinase C downregulates nuclear factor-kappaB signaling by dissociation of IKK-gamma and Hsp90 complex in human colonic epithelial cells. Carcinogenesis. 2006; 28(1):71-80. DOI: 10.1093/carcin/bgl094. View

14.

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N . The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009; 25(16):2078-9. PMC: 2723002. DOI: 10.1093/bioinformatics/btp352. View

15.

Gan Q, Schones D, Eun S, Wei G, Cui K, Zhao K . Monovalent and unpoised status of most genes in undifferentiated cell-enriched Drosophila testis. Genome Biol. 2010; 11(4):R42. PMC: 2884545. DOI: 10.1186/gb-2010-11-4-r42. View

16.

Schandorff S, Olsen J, Bunkenborg J, Blagoev B, Zhang Y, Andersen J . A mass spectrometry-friendly database for cSNP identification. Nat Methods. 2007; 4(6):465-6. DOI: 10.1038/nmeth0607-465. View

17.

Durinck S, Spellman P, Birney E, Huber W . Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat Protoc. 2009; 4(8):1184-91. PMC: 3159387. DOI: 10.1038/nprot.2009.97. View

18.

Chang K, Georgianna D, Heber S, Payne G, Muddiman D . Detection of alternative splice variants at the proteome level in Aspergillus flavus. J Proteome Res. 2010; 9(3):1209-17. DOI: 10.1021/pr900602d. View

19.

Pearl L, Prodromou C . Structure and mechanism of the Hsp90 molecular chaperone machinery. Annu Rev Biochem. 2006; 75:271-94. DOI: 10.1146/annurev.biochem.75.103004.142738. View

20.

Ma Z, Dasari S, Chambers M, Litton M, Sobecki S, Zimmerman L . IDPicker 2.0: Improved protein assembly with high discrimination peptide identification filtering. J Proteome Res. 2009; 8(8):3872-81. PMC: 2810655. DOI: 10.1021/pr900360j. View