» Articles » PMID: 39239360

Utilizing Biological Experimental Data and Molecular Dynamics for the Classification of Mutational Hotspots Through Machine Learning

Overview
Journal Bioinform Adv
Specialty Biology
Date 2024 Sep 6
PMID 39239360
Authors
Affiliations
Soon will be listed here.
Abstract

Motivation: Benzo[]pyrene, a notorious DNA-damaging carcinogen, belongs to the family of polycyclic aromatic hydrocarbons commonly found in tobacco smoke. Surprisingly, nucleotide excision repair (NER) machinery exhibits inefficiency in recognizing specific bulky DNA adducts including Benzo[]pyrene Diol-Epoxide (BPDE), a Benzo[]pyrene metabolite. While sequence context is emerging as the leading factor linking the inadequate NER response to BPDE adducts, the precise structural attributes governing these disparities remain inadequately understood. We therefore combined the domains of molecular dynamics and machine learning to conduct a comprehensive assessment of helical distortion caused by BPDE-Guanine adducts in multiple gene contexts. Specifically, we implemented a dual approach involving a random forest classification-based analysis and subsequent feature selection to identify precise topological features that may distinguish adduct sites of variable repair capacity. Our models were trained using helical data extracted from duplexes representing both BPDE hotspot and nonhotspot sites within the gene, then applied to sites within , , and genes.

Results: We show our optimized model consistently achieved exceptional performance, with accuracy, precision, and f1 scores exceeding 91%. Our feature selection approach uncovered that discernible variance in regional base pair rotation played a pivotal role in informing the decisions of our model. Notably, these disparities were highly conserved among and duplexes and appeared to be influenced by the regional GC content. As such, our findings suggest that there are indeed conserved topological features distinguishing hotspots and nonhotpot sites, highlighting regional GC content as a potential biomarker for mutation.

Availability And Implementation: Code for comparing machine learning classifiers and evaluating their performance is available at https://github.com/jdavies24/ML-Classifier-Comparison, and code for analysing DNA structure with Curves+ and Canal using Random Forest is available at https://github.com/jdavies24/ML-classification-of-DNA-trajectories.

References
1.
Feng Z, Hu W, Chen J, Pao A, Li H, Rom W . Preferential DNA damage and poor repair determine ras gene mutational hotspot in human cancer. J Natl Cancer Inst. 2002; 94(20):1527-36. DOI: 10.1093/jnci/94.20.1527. View

2.
Yu D, Berlin J, Penning T, Field J . Reactive oxygen species generated by PAH o-quinones cause change-in-function mutations in p53. Chem Res Toxicol. 2002; 15(6):832-42. DOI: 10.1021/tx010177m. View

3.
Yella V, Bhimsaria D, Ghoshdastidar D, Rodriguez-Martinez J, Ansari A, Bansal M . Flexibility and structure of flanking DNA impact transcription factor affinity for its core motif. Nucleic Acids Res. 2018; 46(22):11883-11897. PMC: 6294565. DOI: 10.1093/nar/gky1057. View

4.
Hodgkinson A, Eyre-Walker A . Variation in the mutation rate across mammalian genomes. Nat Rev Genet. 2011; 12(11):756-66. DOI: 10.1038/nrg3098. View

5.
Beal M, Gagne R, Williams A, Marchetti F, Yauk C . Characterizing Benzo[a]pyrene-induced lacZ mutation spectrum in transgenic mice using next-generation sequencing. BMC Genomics. 2015; 16:812. PMC: 4617527. DOI: 10.1186/s12864-015-2004-4. View