» Articles » PMID: 35701420

Implementing the Reuse of Public DIA Proteomics Datasets: from the PRIDE Database to Expression Atlas

Abstract

The number of mass spectrometry (MS)-based proteomics datasets in the public domain keeps increasing, particularly those generated by Data Independent Acquisition (DIA) approaches such as SWATH-MS. Unlike Data Dependent Acquisition datasets, the re-use of DIA datasets has been rather limited to date, despite its high potential, due to the technical challenges involved. We introduce a (re-)analysis pipeline for public SWATH-MS datasets which includes a combination of metadata annotation protocols, automated workflows for MS data analysis, statistical analysis, and the integration of the results into the Expression Atlas resource. Automation is orchestrated with Nextflow, using containerised open analysis software tools, rendering the pipeline readily available and reproducible. To demonstrate its utility, we reanalysed 10 public DIA datasets from the PRIDE database, comprising 1,278 SWATH-MS runs. The robustness of the analysis was evaluated, and the results compared to those obtained in the original publications. The final expression values were integrated into Expression Atlas, making SWATH-MS experiments more widely available and combining them with expression data originating from other proteomics and transcriptomics datasets.

Citing Articles

Integrated View of Baseline Protein Expression in Human Tissues Using Public Data Independent Acquisition Data Sets.

Prakash A, Collins A, Vilmovsky L, Fexova S, Jones A, Vizcaino J J Proteome Res. 2025; 24(2):685-695.

PMID: 39764611 PMC: 11811993. DOI: 10.1021/acs.jproteome.4c00788.


Integrated Proteomics Analysis of Baseline Protein Expression in Pig Tissues.

Wang S, Collins A, Prakash A, Fexova S, Papatheodorou I, Jones A J Proteome Res. 2024; 23(6):1948-1959.

PMID: 38717300 PMC: 11165573. DOI: 10.1021/acs.jproteome.3c00741.


PM, component cause of severe metabolically abnormal obesity: An in silico, observational and analytical study.

Lobato S, Castillo-Granada A, Bucio-Pacheco M, Salomon-Soto V, Alvarez-Valenzuela R, Meza-Inostroza P Heliyon. 2024; 10(7):e28936.

PMID: 38601536 PMC: 11004224. DOI: 10.1016/j.heliyon.2024.e28936.


Computational and Systems Biology Advances to Enable Bioagent Agnostic Signatures.

Lin A, Torres C, Hobbs E, Bardhan J, Aley S, Spencer C Health Secur. 2024; 22(2):130-139.

PMID: 38483337 PMC: 11044874. DOI: 10.1089/hs.2023.0076.


Expression Atlas update: insights from sequencing data at both bulk and single cell level.

George N, Fexova S, Fuentes A, Madrigal P, Bi Y, Iqbal H Nucleic Acids Res. 2023; 52(D1):D107-D114.

PMID: 37992296 PMC: 10767917. DOI: 10.1093/nar/gkad1021.


References
1.
Wang S, Garcia-Seisdedos D, Prakash A, Kundu D, Collins A, George N . Integrated view and comparative analysis of baseline protein expression in mouse and rat tissues. PLoS Comput Biol. 2022; 18(6):e1010174. PMC: 9246241. DOI: 10.1371/journal.pcbi.1010174. View

2.
Yang Y, Liu X, Shen C, Lin Y, Yang P, Qiao L . In silico spectral libraries by deep learning facilitate data-independent acquisition proteomics. Nat Commun. 2020; 11(1):146. PMC: 6952453. DOI: 10.1038/s41467-019-13866-z. View

3.
Ochoa D, Jarnuczak A, Vieitez C, Gehre M, Soucheray M, Mateus A . The functional landscape of the human phosphoproteome. Nat Biotechnol. 2019; 38(3):365-373. PMC: 7100915. DOI: 10.1038/s41587-019-0344-3. View

4.
Choi M, Chang C, Clough T, Broudy D, Killeen T, MacLean B . MSstats: an R package for statistical analysis of quantitative mass spectrometry-based proteomic experiments. Bioinformatics. 2014; 30(17):2524-6. DOI: 10.1093/bioinformatics/btu305. View

5.
Rosenberger G, Bludau I, Schmitt U, Heusel M, Hunter C, Liu Y . Statistical control of peptide and protein error rates in large-scale targeted data-independent acquisition analyses. Nat Methods. 2017; 14(9):921-927. PMC: 5581544. DOI: 10.1038/nmeth.4398. View