» Articles » PMID: 11410670

GeneMarkS: a Self-training Method for Prediction of Gene Starts in Microbial Genomes. Implications for Finding Sequence Motifs in Regulatory Regions

Overview
Specialty Biochemistry
Date 2001 Jun 19
PMID 11410670
Citations 1202
Authors
Affiliations
Soon will be listed here.
Abstract

Improving the accuracy of prediction of gene starts is one of a few remaining open problems in computer prediction of prokaryotic genes. Its difficulty is caused by the absence of relatively strong sequence patterns identifying true translation initiation sites. In the current paper we show that the accuracy of gene start prediction can be improved by combining models of protein-coding and non-coding regions and models of regulatory sites near gene start within an iterative Hidden Markov model based algorithm. The new gene prediction method, called GeneMarkS, utilizes a non-supervised training procedure and can be used for a newly sequenced prokaryotic genome with no prior knowledge of any protein or rRNA genes. The GeneMarkS implementation uses an improved version of the gene finding program GeneMark.hmm, heuristic Markov models of coding and non-coding regions and the Gibbs sampling multiple alignment program. GeneMarkS predicted precisely 83.2% of the translation starts of GenBank annotated Bacillus subtilis genes and 94.4% of translation starts in an experimentally validated set of Escherichia coli genes. We have also observed that GeneMarkS detects prokaryotic genes, in terms of identifying open reading frames containing real genes, with an accuracy matching the level of the best currently used gene detection methods. Accurate translation start prediction, in addition to the refinement of protein sequence N-terminal data, provides the benefit of precise positioning of the sequence region situated upstream to a gene start. Therefore, sequence motifs related to transcription and translation regulatory sites can be revealed and analyzed with higher precision. These motifs were shown to possess a significant variability, the functional and evolutionary connections of which are discussed.

Citing Articles

The Complete Genome Sequences of Bacteriophages ASegato, DejaVu, Judebell, and RicoCaldo isolated using .

Logan R, Biratu M, Busila M, Busto I, Caldwell N, Chestnut P MicroPubl Biol. 2025; 2025.

PMID: 40052136 PMC: 11883469. DOI: 10.17912/micropub.biology.001443.


Tabrizicola caldifontis sp. nov., Isolated from Hot Spring Sediment Sample.

Habib N, Khan I, Saqib M, Hejazi M, Tarhriz V, Jan S Curr Microbiol. 2025; 82(4):172.

PMID: 40050427 DOI: 10.1007/s00284-025-04156-7.


The Genome Sequences of Baculoviruses from the Tufted Apple Bud Moth, , Reveal Recombination Between an Alphabaculovirus and a Betabaculovirus from the Same Host.

Harrison R, Jansen M, Fife A, Rowley D Viruses. 2025; 17(2).

PMID: 40006957 PMC: 11861948. DOI: 10.3390/v17020202.


Exploring Viral Interactions in Species: In Silico Analysis of Prophage Prevalence and Antiviral Defenses.

Rubi-Rangel L, Leon-Felix J, Villicana C Life (Basel). 2025; 15(2).

PMID: 40003596 PMC: 11856565. DOI: 10.3390/life15020187.


Isolation and characterization of 24 phages infecting the plant growth-promoting rhizobacterium Klebsiella sp. M5al.

Gittrich M, Sanderson C, Wainaina J, Noel C, Leopold J, Babusci E PLoS One. 2025; 20(2):e0313947.

PMID: 39982899 PMC: 11845039. DOI: 10.1371/journal.pone.0313947.


References
1.
Hertz G, Stormo G . Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics. 1999; 15(7-8):563-77. DOI: 10.1093/bioinformatics/15.7.563. View

2.
Kaneko T, Sato S, Kotani H, Tanaka A, Asamizu E, Nakamura Y . Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions. DNA Res. 1996; 3(3):109-36. DOI: 10.1093/dnares/3.3.109. View

3.
Chen H, Bjerknes M, Kumar R, Jay E . Determination of the optimal aligned spacing between the Shine-Dalgarno sequence and the translation initiation codon of Escherichia coli mRNAs. Nucleic Acids Res. 1994; 22(23):4953-7. PMC: 523762. DOI: 10.1093/nar/22.23.4953. View

4.
Tompa M . An exact method for finding short motifs in sequences, with application to the ribosome binding site problem. Proc Int Conf Intell Syst Mol Biol. 2000; :262-71. View

5.
Fickett J . Recognition of protein coding regions in DNA sequences. Nucleic Acids Res. 1982; 10(17):5303-18. PMC: 320873. DOI: 10.1093/nar/10.17.5303. View