Distribution Rules of 8-mer Spectra and Characterization of Evolution State in Animal Genome Sequences

Overview

Journal BMC Genomics

Publisher Biomed Central

Specialty Genetics

Date 2024 Sep 12

PMID 39266973

Authors

Xiaolong Li

Hong Li

Zhenhua Yang

Lu Wang

Affiliations

Soon will be listed here.

Abstract

Background: Studying the composition rules and evolution mechanisms of genome sequences are core issues in the post-genomic era, and k-mer spectrum analysis of genome sequences is an effective means to solve this problem.

Result: We divided total 8-mers of genome sequences into 16 kinds of XY-type due to XY dinucleotides number in 8-mers. Previous works explored that the independent unimodal distributions observed only in three CG-type 8-mer spectra, while non-CG type 8-mer spectra have not the universal phenomenon from prokaryotes to eukaryotes. On this basis, we analyzed the distribution variation of non-CG type 8-mer spectra across 889 animal genome sequences. Following the evolutionary order of animals from primitive to more complex, we found that the spectrum distributions gradually transition from unimodal to tri-modal. The relative distance from the average frequency of each non-CG type 8-mers to the center frequency is different within a species and among different species. For the 8-mers contain CG dinucleotides, we further divided these into 16 subsets, where each 8-mer contains both CG and XY dinucleotides, called XY1_CG1 subsets. We found that the separability values of XY1_CG1 spectra are closely related to the evolution and specificity of animals. Considering the constraint of Chargaff's second parity rule, we finally obtained 10 separability values as the feature set to characterize the evolution state of genome sequences. In order to verify the rationality of the feature set, we used 14 common classification algorithms to perform binary classification tests. The results showed that the accuracy (Acc) ranged between 98.70% and 83.88% among birds, other vertebrates and mammals.

Conclusion: We proposed a credible feature set to characterizes the evolution state of genomes and obtained satisfied results by the feature set on large scale classification of animals.

References

Wei C, Wang G, Chen X, Huang H, Liu B, Xu Y . Identification and typing of human enterovirus: a genomic barcode approach. PLoS One. 2011; 6(10):e26296. PMC: 3194813. DOI: 10.1371/journal.pone.0026296. View

Prabhu V . Symmetry observations in long nucleotide sequences. Nucleic Acids Res. 1993; 21(12):2797-800. PMC: 309655. DOI: 10.1093/nar/21.12.2797. View

Liu B, Fang L, Long R, Lan X, Chou K . iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition. Bioinformatics. 2015; 32(3):362-9. DOI: 10.1093/bioinformatics/btv604. View

Gudenas B, Wang L . Prediction of LncRNA Subcellular Localization with Deep Learning from Sequence Features. Sci Rep. 2018; 8(1):16385. PMC: 6219567. DOI: 10.1038/s41598-018-34708-w. View

Chen Z, Zhao P, Li C, Li F, Xiang D, Chen Y . iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization. Nucleic Acids Res. 2021; 49(10):e60. PMC: 8191785. DOI: 10.1093/nar/gkab122. View

Xie H, Hao B . Visualization of K-tuple distribution in procaryote complete genomes and their randomized counterparts. Proc IEEE Comput Soc Bioinform Conf. 2005; 1:31-42. View

Subramanian A, Schwartz R . Reference-free inference of tumor phylogenies from single-cell sequencing data. BMC Genomics. 2015; 16 Suppl 11:S7. PMC: 4652515. DOI: 10.1186/1471-2164-16-S11-S7. View

Kafri A, Chor B, Horn D . Inter-chromosomal k-mer distances. BMC Genomics. 2021; 22(1):644. PMC: 8422766. DOI: 10.1186/s12864-021-07952-0. View

Zhou F, Olman V, Xu Y . Barcodes for genomes and applications. BMC Bioinformatics. 2008; 9:546. PMC: 2621371. DOI: 10.1186/1471-2105-9-546. View

10.

Cserhati M, Turoczy Z, Dudits D, Gyorgyey J . The rice word landscape--a detailed catalog of the rice motif content in the noncoding regions. OMICS. 2011; 15(11):819-28. DOI: 10.1089/omi.2011.0132. View

11.

Asim M, Malik M, Zehe C, Trygg J, Dengel A, Ahmed S . MirLocPredictor: A ConvNet-Based Multi-Label MicroRNA Subcellular Localization Predictor by Incorporating k-Mer Positional Information. Genes (Basel). 2020; 11(12). PMC: 7763197. DOI: 10.3390/genes11121475. View

12.

Miller C, Gurd J, Brass A . A RAPID algorithm for sequence database comparisons: application to the identification of vector contamination in the EMBL databases. Bioinformatics. 1999; 15(2):111-21. DOI: 10.1093/bioinformatics/15.2.111. View

13.

Sauk M, Zilina O, Kurg A, Ustav E, Peters M, Paluoja P . NIPTmer: rapid k-mer-based software package for detection of fetal aneuploidies. Sci Rep. 2018; 8(1):5616. PMC: 5884839. DOI: 10.1038/s41598-018-23589-8. View

14.

Williams D, Trimble W, Shilts M, Meyer F, Ochman H . Rapid quantification of sequence repeats to resolve the size, structure and contents of bacterial genomes. BMC Genomics. 2013; 14:537. PMC: 3751351. DOI: 10.1186/1471-2164-14-537. View

15.

Zhang Y, Wen J, Li X, Li G . Exploration of hosts and transmission traits for SARS-CoV-2 based on the k-mer natural vector. Infect Genet Evol. 2021; 93:104933. PMC: 8136293. DOI: 10.1016/j.meegid.2021.104933. View

16.

Chen Y, Nyeo S, Yeh C . Model for the distributions of k-mers in DNA sequences. Phys Rev E Stat Nonlin Soft Matter Phys. 2005; 72(1 Pt 1):011908. DOI: 10.1103/PhysRevE.72.011908. View

17.

Kirk J, Sprague D, Calabrese J . Classification of Long Noncoding RNAs by k-mer Content. Methods Mol Biol. 2020; 2254:41-60. PMC: 7850294. DOI: 10.1007/978-1-0716-1158-6_4. View

18.

Sung I, Lee S, Pak M, Shin Y, Kim S . AutoCoV: tracking the early spread of COVID-19 in terms of the spatial and temporal patterns from embedding space by K-mer based deep learning. BMC Bioinformatics. 2022; 23(Suppl 3):149. PMC: 9036508. DOI: 10.1186/s12859-022-04679-x. View

19.

Su Z, Huang Y, Zhang Z, Zhao Y, Wang D, Chen W . iLoc-lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC. Bioinformatics. 2018; 34(24):4196-4204. DOI: 10.1093/bioinformatics/bty508. View

20.

Yang Z, Li H, Jia Y, Zheng Y, Meng H, Bao T . Intrinsic laws of k-mer spectra of genome sequences and evolution mechanism of genomes. BMC Evol Biol. 2020; 20(1):157. PMC: 7684957. DOI: 10.1186/s12862-020-01723-3. View