» Articles » PMID: 39266973

Distribution Rules of 8-mer Spectra and Characterization of Evolution State in Animal Genome Sequences

Overview
Journal BMC Genomics
Publisher Biomed Central
Specialty Genetics
Date 2024 Sep 12
PMID 39266973
Authors
Affiliations
Soon will be listed here.
Abstract

Background: Studying the composition rules and evolution mechanisms of genome sequences are core issues in the post-genomic era, and k-mer spectrum analysis of genome sequences is an effective means to solve this problem.

Result: We divided total 8-mers of genome sequences into 16 kinds of XY-type due to XY dinucleotides number in 8-mers. Previous works explored that the independent unimodal distributions observed only in three CG-type 8-mer spectra, while non-CG type 8-mer spectra have not the universal phenomenon from prokaryotes to eukaryotes. On this basis, we analyzed the distribution variation of non-CG type 8-mer spectra across 889 animal genome sequences. Following the evolutionary order of animals from primitive to more complex, we found that the spectrum distributions gradually transition from unimodal to tri-modal. The relative distance from the average frequency of each non-CG type 8-mers to the center frequency is different within a species and among different species. For the 8-mers contain CG dinucleotides, we further divided these into 16 subsets, where each 8-mer contains both CG and XY dinucleotides, called XY1_CG1 subsets. We found that the separability values of XY1_CG1 spectra are closely related to the evolution and specificity of animals. Considering the constraint of Chargaff's second parity rule, we finally obtained 10 separability values as the feature set to characterize the evolution state of genome sequences. In order to verify the rationality of the feature set, we used 14 common classification algorithms to perform binary classification tests. The results showed that the accuracy (Acc) ranged between 98.70% and 83.88% among birds, other vertebrates and mammals.

Conclusion: We proposed a credible feature set to characterizes the evolution state of genomes and obtained satisfied results by the feature set on large scale classification of animals.

References
1.
Wei C, Wang G, Chen X, Huang H, Liu B, Xu Y . Identification and typing of human enterovirus: a genomic barcode approach. PLoS One. 2011; 6(10):e26296. PMC: 3194813. DOI: 10.1371/journal.pone.0026296. View

2.
Prabhu V . Symmetry observations in long nucleotide sequences. Nucleic Acids Res. 1993; 21(12):2797-800. PMC: 309655. DOI: 10.1093/nar/21.12.2797. View

3.
Liu B, Fang L, Long R, Lan X, Chou K . iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition. Bioinformatics. 2015; 32(3):362-9. DOI: 10.1093/bioinformatics/btv604. View

4.
Gudenas B, Wang L . Prediction of LncRNA Subcellular Localization with Deep Learning from Sequence Features. Sci Rep. 2018; 8(1):16385. PMC: 6219567. DOI: 10.1038/s41598-018-34708-w. View

5.
Chen Z, Zhao P, Li C, Li F, Xiang D, Chen Y . iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization. Nucleic Acids Res. 2021; 49(10):e60. PMC: 8191785. DOI: 10.1093/nar/gkab122. View