» Articles » PMID: 36175448

Deciphering Microbial Gene Function Using Natural Language Processing

Overview
Journal Nat Commun
Specialty Biology
Date 2022 Sep 29
PMID 36175448
Authors
Affiliations
Soon will be listed here.
Abstract

Revealing the function of uncharacterized genes is a fundamental challenge in an era of ever-increasing volumes of sequencing data. Here, we present a concept for tackling this challenge using deep learning methodologies adopted from natural language processing (NLP). We repurpose NLP algorithms to model "gene semantics" based on a biological corpus of more than 360 million microbial genes within their genomic context. We use the language models to predict functional categories for 56,617 genes and find that out of 1369 genes associated with recently discovered defense systems, 98% are inferred correctly. We then systematically evaluate the "discovery potential" of different functional categories, pinpointing those with the most genes yet to be characterized. Finally, we demonstrate our method's ability to discover systems associated with microbial interaction and defense. Our results highlight that combining microbial genomics and language models is a promising avenue for revealing gene functions in microbes.

Citing Articles

Molecular basis of foreign DNA recognition by BREX anti-phage immunity system.

Drobiazko A, Adams M, Skutel M, Potekhina K, Kotovskaya O, Trofimova A Nat Commun. 2025; 16(1):1825.

PMID: 39979294 PMC: 11842806. DOI: 10.1038/s41467-025-57006-2.


Throw out an oligopeptide to catch a protein: Deep learning and natural language processing-screened tripeptide PSP promotes Osteolectin-mediated vascularized bone regeneration.

Chen Y, Chen L, Wu J, Xu X, Yang C, Zhang Y Bioact Mater. 2024; 46:37-54.

PMID: 39734571 PMC: 11681832. DOI: 10.1016/j.bioactmat.2024.11.011.


Application of machine learning based genome sequence analysis in pathogen identification.

Gao Y, Liu M Front Microbiol. 2024; 15:1474078.

PMID: 39417073 PMC: 11480060. DOI: 10.3389/fmicb.2024.1474078.


Quest for Orthologs in the Era of Biodiversity Genomics.

Langschied F, Bordin N, Cosentino S, Fuentes-Palacios D, Glover N, Hiller M Genome Biol Evol. 2024; 16(10).

PMID: 39404012 PMC: 11523110. DOI: 10.1093/gbe/evae224.


Diverse anti-defence systems are encoded in the leading region of plasmids.

Samuel B, Mittelman K, Croitoru S, Ben Haim M, Burstein D Nature. 2024; 635(8037):186-192.

PMID: 39385022 PMC: 11541004. DOI: 10.1038/s41586-024-07994-w.


References
1.
Li W, Godzik A . Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006; 22(13):1658-9. DOI: 10.1093/bioinformatics/btl158. View

2.
Millman A, Melamed S, Leavitt A, Doron S, Bernheim A, Hor J . An expanded arsenal of immune systems that protect bacteria from phages. Cell Host Microbe. 2022; 30(11):1556-1569.e5. DOI: 10.1016/j.chom.2022.09.017. View

3.
Katoh K, Misawa K, Kuma K, Miyata T . MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002; 30(14):3059-66. PMC: 135756. DOI: 10.1093/nar/gkf436. View

4.
Parks D, Rinke C, Chuvochina M, Chaumeil P, Woodcroft B, Evans P . Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat Microbiol. 2017; 2(11):1533-1542. DOI: 10.1038/s41564-017-0012-7. View

5.
Ma Y, Guo Z, Xia B, Zhang Y, Liu X, Yu Y . Identification of antimicrobial peptides from the human gut microbiome using deep learning. Nat Biotechnol. 2022; 40(6):921-931. DOI: 10.1038/s41587-022-01226-0. View