SEAOP: a Statistical Ensemble Approach for Outlier Detection in Quantitative Proteomics Data

Overview

Journal Brief Bioinform

Publisher Oxford University Press

Specialty Biology

Date 2024 Apr 1

PMID 38557674

Authors

Jinze Huang

Yang Zhao

Bo Meng

Ao Lu

Yaoguang Wei

Lianhua Dong

Xiang Fang

Dong An

Xinhua Dai

Affiliations

Soon will be listed here.

Abstract

Quality control in quantitative proteomics is a persistent challenge, particularly in identifying and managing outliers. Unsupervised learning models, which rely on data structure rather than predefined labels, offer potential solutions. However, without clear labels, their effectiveness might be compromised. Single models are susceptible to the randomness of parameters and initialization, which can result in a high rate of false positives. Ensemble models, on the other hand, have shown capabilities in effectively mitigating the impacts of such randomness and assisting in accurately detecting true outliers. Therefore, we introduced SEAOP, a Python toolbox that utilizes an ensemble mechanism by integrating multi-round data management and a statistics-based decision pipeline with multiple models. Specifically, SEAOP uses multi-round resampling to create diverse sub-data spaces and employs outlier detection methods to identify candidate outliers in each space. Candidates are then aggregated as confirmed outliers via a chi-square test, adhering to a 95% confidence level, to ensure the precision of the unsupervised approaches. Additionally, SEAOP introduces a visualization strategy, specifically designed to intuitively and effectively display the distribution of both outlier and non-outlier samples. Optimal hyperparameter models of SEAOP for outlier detection were identified by using a gradient-simulated standard dataset and Mann-Kendall trend test. The performance of the SEAOP toolbox was evaluated using three experimental datasets, confirming its reliability and accuracy in handling quantitative proteomics.

Citing Articles

Enhanced Analysis of Low-Abundance Proteins in Soybean Seeds Using Advanced Mass Spectrometry.

Meng B, Huang Y, Lu A, Liao H, Zhai R, Gong X Int J Mol Sci. 2025; 26(3).

PMID: 39940716 PMC: 11817203. DOI: 10.3390/ijms26030949.

ProteoNet: A CNN-based framework for analyzing proteomics MS-RGB images.

Huang J, Li Y, Meng B, Zhang Y, Wei Y, Dai X iScience. 2024; 27(12):111362.

PMID: 39679296 PMC: 11638609. DOI: 10.1016/j.isci.2024.111362.

References

Olivella R, Chiva C, Serret M, Mancera D, Cozzuto L, Hermoso A . QCloud2: An Improved Cloud-based Quality-Control System for Mass-Spectrometry-based Proteomics Laboratories. J Proteome Res. 2021; 20(4):2010-2013. DOI: 10.1021/acs.jproteome.0c00853. View

Zheng Y, Liu Y, Yang J, Dong L, Zhang R, Tian S . Multi-omics data integration using ratio-based quantitative profiling with Quartet reference materials. Nat Biotechnol. 2023; 42(7):1133-1149. PMC: 11252085. DOI: 10.1038/s41587-023-01934-1. View

Bielow C, Mastrobuoni G, Kempa S . Proteomics Quality Control: Quality Control Software for MaxQuant Results. J Proteome Res. 2015; 15(3):777-87. DOI: 10.1021/acs.jproteome.5b00780. View

Wang S, Li W, Hu L, Cheng J, Yang H, Liu Y . NAguideR: performing and prioritizing missing value imputations for consistent bottom-up proteomic analyses. Nucleic Acids Res. 2020; 48(14):e83. PMC: 7641313. DOI: 10.1093/nar/gkaa498. View

Degnan D, Stratton K, Richardson R, Claborne D, Martin E, Johnson N . : A Quality Control, Visualization, and Statistics Pipeline for Multiple Omics Datatypes. J Proteome Res. 2023; 22(2):570-576. DOI: 10.1021/acs.jproteome.2c00610. View

Bittremieux W, Valkenborg D, Martens L, Laukens K . Computational quality control tools for mass spectrometry proteomics. Proteomics. 2016; 17(3-4). DOI: 10.1002/pmic.201600159. View

Giudice G, Petsalaki E . Proteomics and phosphoproteomics in precision medicine: applications and challenges. Brief Bioinform. 2017; 20(3):767-777. PMC: 6585152. DOI: 10.1093/bib/bbx141. View

Rozanova S, Uszkoreit J, Schork K, Serschnitzki B, Eisenacher M, Tonges L . Quality Control-A Stepchild in Quantitative Proteomics: A Case Study for the Human CSF Proteome. Biomolecules. 2023; 13(3). PMC: 10046854. DOI: 10.3390/biom13030491. View

Yang J, Liu Y, Shang J, Chen Q, Chen Q, Ren L . The Quartet Data Portal: integration of community-wide resources for multiomics quality control. Genome Biol. 2023; 24(1):245. PMC: 10601216. DOI: 10.1186/s13059-023-03091-9. View

10.

Zhao Y, Wang M, Meng B, Gao Y, Xue Z, He M . Identification of Dysregulated Complement Activation Pathways Driven by N-Glycosylation Alterations in T2D Patients. Front Chem. 2021; 9:677621. PMC: 8226093. DOI: 10.3389/fchem.2021.677621. View

11.

Tian S, Zhan D, Yu Y, Wang Y, Liu M, Tan S . Quartet protein reference materials and datasets for multi-platform assessment of label-free proteomics. Genome Biol. 2023; 24(1):202. PMC: 10483797. DOI: 10.1186/s13059-023-03048-y. View

12.

Ku X, Wang J, Li H, Meng C, Yu F, Yu W . Proteomic Portrait of Human Lymphoma Reveals Protein Molecular Fingerprint of Disease Specific Subtypes and Progression. Phenomics. 2023; 3(2):148-166. PMC: 10110798. DOI: 10.1007/s43657-022-00075-w. View

13.

Chen T, Ma J, Liu Y, Chen Z, Xiao N, Lu Y . iProX in 2021: connecting proteomics data sharing with big data. Nucleic Acids Res. 2021; 50(D1):D1522-D1527. PMC: 8728291. DOI: 10.1093/nar/gkab1081. View

14.

Zhao Y, Xue Q, Wang M, Meng B, Jiang Y, Zhai R . Evolution of Mass Spectrometry Instruments and Techniques for Blood Proteomics. J Proteome Res. 2023; 22(4):1009-1023. DOI: 10.1021/acs.jproteome.3c00102. View

15.

Jiang Y, Sun A, Zhao Y, Ying W, Sun H, Yang X . Proteomics identifies new therapeutic targets of early-stage hepatocellular carcinoma. Nature. 2019; 567(7747):257-261. DOI: 10.1038/s41586-019-0987-8. View

16.

Xiao Q, Zhang F, Xu L, Yue L, Kon O, Zhu Y . High-throughput proteomics and AI for cancer biomarker discovery. Adv Drug Deliv Rev. 2021; 176:113844. DOI: 10.1016/j.addr.2021.113844. View

17.

Mann M, Kumar C, Zeng W, Strauss M . Artificial intelligence for proteomics and biomarker discovery. Cell Syst. 2021; 12(8):759-770. DOI: 10.1016/j.cels.2021.06.006. View

18.

Ma Z, Polzin K, Dasari S, Chambers M, Schilling B, Gibson B . QuaMeter: multivendor performance metrics for LC-MS/MS proteomics instrumentation. Anal Chem. 2012; 84(14):5845-50. PMC: 3730131. DOI: 10.1021/ac300629p. View

19.

Cox J, Mann M . MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat Biotechnol. 2008; 26(12):1367-72. DOI: 10.1038/nbt.1511. View

20.

Scholkopf B, Platt J, Shawe-Taylor J, Smola A, Williamson R . Estimating the support of a high-dimensional distribution. Neural Comput. 2001; 13(7):1443-71. DOI: 10.1162/089976601750264965. View