MMPatho: Leveraging Multilevel Consensus and Evolutionary Information for Enhanced Missense Mutation Pathogenic Prediction

Overview

Journal J Chem Inf Model

Publisher American Chemical Society

Specialties Chemistry
Medical Informatics

Date 2023 Nov 10

PMID 37947586

Authors

Fang Ge

Muhammad Arif

Zihao Yan

Hanin Alahmadi

Apilak Worachartcheewan

Dong-Jun Yu

Watshara Shoombuatong

Affiliations

Soon will be listed here.

Abstract

Understanding the pathogenicity of missense mutation (MM) is essential for shed light on genetic diseases, gene functions, and individual variations. In this study, we propose a novel computational approach, called MMPatho, for enhancing missense mutation pathogenic prediction. First, we established a large-scale nonredundant MM benchmark data set based on the entire Ensembl database, complemented by a focused blind test set specifically for pathogenic GOF/LOF MM. Based on this data set, for each mutation, we utilized Ensembl VEP v104 and dbNSFP v4.1a to extract variant-level, amino acid-level, individuals' outputs, and genome-level features. Additionally, protein sequences were generated using ENSP identifiers with the Ensembl API, and then encoded. The mutant sites' ESM-1b and ProtTrans-T5 embeddings were subsequently extracted. Then, our model group (MMPatho) was developed by leveraging upon these efforts, which comprised ConsMM and EvoIndMM. To be specific, ConsMM employs individuals' outputs and XGBoost with SHAP explanation analysis, while EvoIndMM investigates the potential enhancement of predictive capability by incorporating evolutionary information from ESM-1b and ProtT5-XL-U50, large protein language embeddings. Through rigorous comparative experiments, both ConsMM and EvoIndMM were capable of achieving remarkable AUROC (0.9836 and 0.9854) and AUPR (0.9852 and 0.9902) values on the blind test set devoid of overlapping variations and proteins from the training data, thus highlighting the superiority of our computational approach in the prediction of MM pathogenicity. Our Web server, available at http://csbio.njust.edu.cn/bioinf/mmpatho/, allows researchers to predict the pathogenicity (alongside the reliability index score) of MMs using the ConsMM and EvoIndMM models and provides extensive annotations for user input. Additionally, the newly constructed benchmark data set and blind test set can be accessed via the data page of our web server.

Citing Articles

StackAHTPs: An explainable antihypertensive peptides identifier based on heterogeneous features and stacked learning approach.

Ghulam A, Arif M, Unar A, A Thafar M, Albaradei S, Worachartcheewan A IET Syst Biol. 2025; 19(1):e70002.

PMID: 39905861 PMC: 11794993. DOI: 10.1049/syb2.70002.

TCellPredX: A Novel Approach for Accurate Prediction of Hepatitis C Virus Linear T Cell Epitopes.

Ge F, Li H, Zhang M, Arif M, Alam T ACS Omega. 2025; 9(52):51494-51507.

PMID: 39758636 PMC: 11696426. DOI: 10.1021/acsomega.4c08715.

PRITrans: A Transformer-Based Approach for the Prediction of the Effects of Missense Mutation on Protein-RNA Interactions.

Ge F, Li C, Zhang C, Zhang M, Yu D Int J Mol Sci. 2024; 25(22).

PMID: 39596413 PMC: 11594650. DOI: 10.3390/ijms252212348.

Deep-m5U: a deep learning-based approach for RNA 5-methyluridine modification prediction using optimized feature integration.

Noor S, Naseem A, Awan H, Aslam W, Khan S, AlQahtani S BMC Bioinformatics. 2024; 25(1):360.

PMID: 39563239 PMC: 11577875. DOI: 10.1186/s12859-024-05978-1.

MetaCGRP is a high-precision meta-model for large-scale identification of CGRP inhibitors using multi-view information.

Schaduangrat N, Khemawoot P, Jiso A, Charoenkwan P, Shoombuatong W Sci Rep. 2024; 14(1):24764.

PMID: 39433940 PMC: 11494111. DOI: 10.1038/s41598-024-75487-x.

References

Carter H, Douville C, Stenson P, Cooper D, Karchin R . Identifying Mendelian disease genes with the variant effect scoring tool. BMC Genomics. 2013; 14 Suppl 3:S3. PMC: 3665549. DOI: 10.1186/1471-2164-14-S3-S3. View

Quan L, Wu H, Lyu Q, Zhang Y . DAMpred: Recognizing Disease-Associated nsSNPs through Bayes-Guided Neural-Network Model Built on Low-Resolution Structure Prediction of Proteins and Protein-Protein Interactions. J Mol Biol. 2019; 431(13):2449-2459. PMC: 6589125. DOI: 10.1016/j.jmb.2019.02.017. View

Lek M, Karczewski K, Minikel E, Samocha K, Banks E, Fennell T . Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016; 536(7616):285-91. PMC: 5018207. DOI: 10.1038/nature19057. View

Lu Q, Hu Y, Sun J, Cheng Y, Cheung K, Zhao H . A statistical framework to predict functional non-coding regions in the human genome through integrated analysis of annotation data. Sci Rep. 2015; 5:10576. PMC: 4444969. DOI: 10.1038/srep10576. View

Foulkes W, Priest J, Duchaine T . DICER1: mutations, microRNAs and mechanisms. Nat Rev Cancer. 2014; 14(10):662-72. DOI: 10.1038/nrc3802. View

Pollard K, Hubisz M, Rosenbloom K, Siepel A . Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 2009; 20(1):110-21. PMC: 2798823. DOI: 10.1101/gr.097857.109. View

Vaser R, Adusumalli S, Leng S, Sikic M, Ng P . SIFT missense predictions for genomes. Nat Protoc. 2015; 11(1):1-9. DOI: 10.1038/nprot.2015.123. View

Henikoff S, Henikoff J . Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992; 89(22):10915-9. PMC: 50453. DOI: 10.1073/pnas.89.22.10915. View

Hassan M, Shaalan A, Dessouky M, Abdelnaiem A, ElHefnawi M . A review study: Computational techniques for expecting the impact of non-synonymous single nucleotide variants in human diseases. Gene. 2018; 680:20-33. DOI: 10.1016/j.gene.2018.09.028. View

10.

Dosztanyi Z, Meszaros B, Simon I . ANCHOR: web server for predicting protein binding regions in disordered proteins. Bioinformatics. 2009; 25(20):2745-6. PMC: 2759549. DOI: 10.1093/bioinformatics/btp518. View

11.

Ionita-Laza I, McCallum K, Xu B, Buxbaum J . A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat Genet. 2016; 48(2):214-20. PMC: 4731313. DOI: 10.1038/ng.3477. View

12.

Grimm D, Azencott C, Aicheler F, Gieraths U, MacArthur D, Samocha K . The evaluation of tools used to predict the impact of missense variants is hindered by two types of circularity. Hum Mutat. 2015; 36(5):513-23. PMC: 4409520. DOI: 10.1002/humu.22768. View

13.

Meszaros B, Erdos G, Dosztanyi Z . IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic Acids Res. 2018; 46(W1):W329-W337. PMC: 6030935. DOI: 10.1093/nar/gky384. View

14.

Ng P, Henikoff S . SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003; 31(13):3812-4. PMC: 168916. DOI: 10.1093/nar/gkg509. View

15.

Abecasis G, Altshuler D, Auton A, Brooks L, Durbin R, Gibbs R . A map of human genome variation from population-scale sequencing. Nature. 2010; 467(7319):1061-73. PMC: 3042601. DOI: 10.1038/nature09534. View

16.

Colonna M, Wang Y . TREM2 variants: new keys to decipher Alzheimer disease pathogenesis. Nat Rev Neurosci. 2016; 17(4):201-7. DOI: 10.1038/nrn.2016.7. View

17.

. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 2022; 51(D1):D523-D531. PMC: 9825514. DOI: 10.1093/nar/gkac1052. View

18.

Shihab H, Gough J, Cooper D, Stenson P, Barker G, Edwards K . Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden Markov models. Hum Mutat. 2012; 34(1):57-65. PMC: 3558800. DOI: 10.1002/humu.22225. View

19.

Davydov E, Goode D, Sirota M, Cooper G, Sidow A, Batzoglou S . Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput Biol. 2010; 6(12):e1001025. PMC: 2996323. DOI: 10.1371/journal.pcbi.1001025. View

20.

Sherry S, Ward M, Kholodov M, Baker J, Phan L, Smigielski E . dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2000; 29(1):308-11. PMC: 29783. DOI: 10.1093/nar/29.1.308. View