Implementing the Reuse of Public DIA Proteomics Datasets: from the PRIDE Database to Expression Atlas

Overview

Journal Sci Data

Specialty Science

Date 2022 Jun 14

PMID 35701420

Authors

Mathias Walzer

David Garcia-Seisdedos

Ananth Prakash

Paul Brack

Peter Crowther

Robert L Graham

Nancy George

Suhaib Mohammed

Pablo Moreno

Irene Papatheodorou

Simon J Hubbard

Juan Antonio Vizcaino

Affiliations

Soon will be listed here.

Abstract

The number of mass spectrometry (MS)-based proteomics datasets in the public domain keeps increasing, particularly those generated by Data Independent Acquisition (DIA) approaches such as SWATH-MS. Unlike Data Dependent Acquisition datasets, the re-use of DIA datasets has been rather limited to date, despite its high potential, due to the technical challenges involved. We introduce a (re-)analysis pipeline for public SWATH-MS datasets which includes a combination of metadata annotation protocols, automated workflows for MS data analysis, statistical analysis, and the integration of the results into the Expression Atlas resource. Automation is orchestrated with Nextflow, using containerised open analysis software tools, rendering the pipeline readily available and reproducible. To demonstrate its utility, we reanalysed 10 public DIA datasets from the PRIDE database, comprising 1,278 SWATH-MS runs. The robustness of the analysis was evaluated, and the results compared to those obtained in the original publications. The final expression values were integrated into Expression Atlas, making SWATH-MS experiments more widely available and combining them with expression data originating from other proteomics and transcriptomics datasets.

Citing Articles

Integrated View of Baseline Protein Expression in Human Tissues Using Public Data Independent Acquisition Data Sets.

Prakash A, Collins A, Vilmovsky L, Fexova S, Jones A, Vizcaino J J Proteome Res. 2025; 24(2):685-695.

PMID: 39764611 PMC: 11811993. DOI: 10.1021/acs.jproteome.4c00788.

Integrated Proteomics Analysis of Baseline Protein Expression in Pig Tissues.

Wang S, Collins A, Prakash A, Fexova S, Papatheodorou I, Jones A J Proteome Res. 2024; 23(6):1948-1959.

PMID: 38717300 PMC: 11165573. DOI: 10.1021/acs.jproteome.3c00741.

PM, component cause of severe metabolically abnormal obesity: An in silico, observational and analytical study.

Lobato S, Castillo-Granada A, Bucio-Pacheco M, Salomon-Soto V, Alvarez-Valenzuela R, Meza-Inostroza P Heliyon. 2024; 10(7):e28936.

PMID: 38601536 PMC: 11004224. DOI: 10.1016/j.heliyon.2024.e28936.

Computational and Systems Biology Advances to Enable Bioagent Agnostic Signatures.

Lin A, Torres C, Hobbs E, Bardhan J, Aley S, Spencer C Health Secur. 2024; 22(2):130-139.

PMID: 38483337 PMC: 11044874. DOI: 10.1089/hs.2023.0076.

Expression Atlas update: insights from sequencing data at both bulk and single cell level.

George N, Fexova S, Fuentes A, Madrigal P, Bi Y, Iqbal H Nucleic Acids Res. 2023; 52(D1):D107-D114.

PMID: 37992296 PMC: 10767917. DOI: 10.1093/nar/gkad1021.

References

Wang S, Garcia-Seisdedos D, Prakash A, Kundu D, Collins A, George N . Integrated view and comparative analysis of baseline protein expression in mouse and rat tissues. PLoS Comput Biol. 2022; 18(6):e1010174. PMC: 9246241. DOI: 10.1371/journal.pcbi.1010174. View

Yang Y, Liu X, Shen C, Lin Y, Yang P, Qiao L . In silico spectral libraries by deep learning facilitate data-independent acquisition proteomics. Nat Commun. 2020; 11(1):146. PMC: 6952453. DOI: 10.1038/s41467-019-13866-z. View

Ochoa D, Jarnuczak A, Vieitez C, Gehre M, Soucheray M, Mateus A . The functional landscape of the human phosphoproteome. Nat Biotechnol. 2019; 38(3):365-373. PMC: 7100915. DOI: 10.1038/s41587-019-0344-3. View

Choi M, Chang C, Clough T, Broudy D, Killeen T, MacLean B . MSstats: an R package for statistical analysis of quantitative mass spectrometry-based proteomic experiments. Bioinformatics. 2014; 30(17):2524-6. DOI: 10.1093/bioinformatics/btu305. View

Rosenberger G, Bludau I, Schmitt U, Heusel M, Hunter C, Liu Y . Statistical control of peptide and protein error rates in large-scale targeted data-independent acquisition analyses. Nat Methods. 2017; 14(9):921-927. PMC: 5581544. DOI: 10.1038/nmeth.4398. View

Guo T, Li L, Zhong Q, Rupp N, Charmpi K, Wong C . Multi-region proteome analysis quantifies spatial heterogeneity of prostate tissue biomarkers. Life Sci Alliance. 2018; 1(2). PMC: 6078179. DOI: 10.26508/lsa.201800042. View

Bouwmeester R, Gabriels R, Van Den Bossche T, Martens L, Degroeve S . The Age of Data-Driven Proteomics: How Machine Learning Enables Novel Workflows. Proteomics. 2020; 20(21-22):e1900351. DOI: 10.1002/pmic.201900351. View

Rost H, Aebersold R, Schubert O . Automated SWATH Data Analysis Using Targeted Extraction of Ion Chromatograms. Methods Mol Biol. 2017; 1550:289-307. DOI: 10.1007/978-1-4939-6747-6_20. View

Valo I, Raro P, Boissard A, Maarouf A, Jezequel P, Verriele V . OLFM4 Expression in Ductal Carcinoma In Situ and in Invasive Breast Cancer Cohorts by a SWATH-Based Proteomic Approach. Proteomics. 2019; 19(21-22):e1800446. DOI: 10.1002/pmic.201800446. View

10.

Tiwary S, Levy R, Gutenbrunner P, Salinas Soto F, Palaniappan K, Deming L . High-quality MS/MS spectrum prediction for data-dependent and data-independent acquisition data analysis. Nat Methods. 2019; 16(6):519-525. DOI: 10.1038/s41592-019-0427-6. View

11.

Di Tommaso P, Chatzou M, Floden E, Prieto Barja P, Palumbo E, Notredame C . Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017; 35(4):316-319. DOI: 10.1038/nbt.3820. View

12.

Weerakoon H, Potriquet J, Shah A, Reed S, Jayakody B, Kapil C . A primary human T-cell spectral library to facilitate large scale quantitative T-cell proteomics. Sci Data. 2020; 7(1):412. PMC: 7683684. DOI: 10.1038/s41597-020-00744-3. View

13.

Van Puyvelde B, Willems S, Gabriels R, Daled S, De Clerck L, Vande Casteele S . Removing the Hidden Data Dependency of DIA with Predicted Spectral Libraries. Proteomics. 2020; 20(3-4):e1900306. DOI: 10.1002/pmic.201900306. View

14.

Deutsch E, Bandeira N, Sharma V, Perez-Riverol Y, Carver J, Kundu D . The ProteomeXchange consortium in 2020: enabling 'big data' approaches in proteomics. Nucleic Acids Res. 2019; 48(D1):D1145-D1152. PMC: 7145525. DOI: 10.1093/nar/gkz984. View

15.

Deutsch E, Lam H, Aebersold R . PeptideAtlas: a resource for target selection for emerging targeted proteomics workflows. EMBO Rep. 2008; 9(5):429-34. PMC: 2373374. DOI: 10.1038/embor.2008.56. View

16.

Talavera D, Kershaw C, Costello J, Castelli L, Rowe W, Sims P . Archetypal transcriptional blocks underpin yeast gene regulation in response to changes in growth conditions. Sci Rep. 2018; 8(1):7949. PMC: 5962585. DOI: 10.1038/s41598-018-26170-5. View

17.

Vaudel M, Verheggen K, Csordas A, Raeder H, Berven F, Martens L . Exploring the potential of public proteomics data. Proteomics. 2015; 16(2):214-25. PMC: 4738454. DOI: 10.1002/pmic.201500295. View

18.

Rung J, Brazma A . Reuse of public genome-wide gene expression data. Nat Rev Genet. 2012; 14(2):89-99. DOI: 10.1038/nrg3394. View

19.

Rost H, Rosenberger G, Navarro P, Gillet L, Miladinovic S, Schubert O . OpenSWATH enables automated, targeted analysis of data-independent acquisition MS data. Nat Biotechnol. 2014; 32(3):219-23. DOI: 10.1038/nbt.2841. View

20.

Peters S, Hains P, Lucas N, Robinson P, Tully B . A Case Study and Methodology for OpenSWATH Parameter Optimization Using the ProCan90 Data Set and 45 810 Computational Analysis Runs. J Proteome Res. 2019; 18(3):1019-1031. DOI: 10.1021/acs.jproteome.8b00709. View