» Articles » PMID: 38478526

Speeding Genomic Island Discovery Through Systematic Design of Reference Database Composition

Overview
Journal PLoS One
Date 2024 Mar 13
PMID 38478526
Authors
Affiliations
Soon will be listed here.
Abstract

Background: Genomic islands (GIs) are mobile genetic elements that integrate site-specifically into bacterial chromosomes, bearing genes that affect phenotypes such as pathogenicity and metabolism. GIs typically occur sporadically among related bacterial strains, enabling comparative genomic approaches to GI identification. For a candidate GI in a query genome, the number of reference genomes with a precise deletion of the GI serves as a support value for the GI. Our comparative software for GI identification was slowed by our original use of large reference genome databases (DBs). Here we explore smaller species-focused DBs.

Results: With increasing DB size, recovery of our reliable prophage GI calls reached a plateau, while recovery of less reliable GI calls (FPs) increased rapidly as DB sizes exceeded ~500 genomes; i.e., overlarge DBs can increase FP rates. Paradoxically, relative to prophages, FPs were both more frequently supported only by genomes outside the species and more frequently supported only by genomes inside the species; this may be due to their generally lower support values. Setting a DB size limit for our SMAll Ranked Tailored (SMART) DB design speeded runtime ~65-fold. Strictly intra-species DBs would tend to lower yields of prophages for small species (with few genomes available); simulations with large species showed that this could be partially overcome by reaching outside the species to closely related taxa, without an FP burden. Employing such taxonomic outreach in DB design generated redundancy in the DB set; as few as 2984 DBs were needed to cover all 47894 prokaryotic species.

Conclusions: Runtime decreased dramatically with SMART DB design, with only minor losses of prophages. We also describe potential utility in other comparative genomics projects.

References
1.
Jandrasits C, Dabrowski P, Fuchs S, Renard B . seq-seq-pan: building a computational pan-genome data structure on whole genome alignment. BMC Genomics. 2018; 19(1):47. PMC: 5769345. DOI: 10.1186/s12864-017-4401-3. View

2.
Holt K, Lassalle F, Wyres K, Wick R, Mostowy R . Diversity and evolution of surface polysaccharide synthesis loci in Enterobacteriales. ISME J. 2020; 14(7):1713-1730. PMC: 7305143. DOI: 10.1038/s41396-020-0628-0. View

3.
Mageeney C, Sinha A, Mosesso R, Medlin D, Lau B, Rokes A . Computational Basis for On-Demand Production of Diversified Therapeutic Phage Cocktails. mSystems. 2020; 5(4). PMC: 7426155. DOI: 10.1128/mSystems.00659-20. View

4.
Parks D, Chuvochina M, Rinke C, Mussig A, Chaumeil P, Hugenholtz P . GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Res. 2021; 50(D1):D785-D794. PMC: 8728215. DOI: 10.1093/nar/gkab776. View

5.
Ondov B, Treangen T, Melsted P, Mallonee A, Bergman N, Koren S . Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016; 17(1):132. PMC: 4915045. DOI: 10.1186/s13059-016-0997-x. View