Genome Annotation Errors in Pathway Databases Due to Semantic Ambiguity in Partial EC Numbers

Overview

Journal Nucleic Acids Res

Publisher Oxford University Press

Specialty Biochemistry

Date 2005 Jul 22

PMID 16034025

Citations 40

Authors

M L Green

P D Karp

Affiliations

Soon will be listed here.

Abstract

We report on a new type of systematic annotation error in genome and pathway databases that results from the misinterpretation of partial Enzyme Commission (EC) numbers such as '1.1.1.-'. This error results in the assignment of genes annotated with a partial EC number to many or all biochemical reactions that are annotated with the same partial EC number. That inference is faulty because of the ambiguous nature of partial EC numbers. We have observed this type of error in multiple databases, including KEGG, VIMSS and IMG, all of which assign genes to KEGG pathways. The Escherichia coli subset of the KEGG database exhibits this error for 6.8% of its gene-reaction assignments. For example, KEGG contains 17 reactions that are annotated with EC 1.1.1.-. A group of three E.coli genes, b1580 [putative dehydrogenase, NAD(P)-binding, starvation-sensing protein], b3787 (UDP-N-acetyl-D-mannosaminuronic acid dehydrogenase) and b0207 (2,5-diketo-D-gluconate reductase B), is assigned to 15 of those reactions, despite experimental evidence indicating different single functions for two of the three genes. Furthermore, the databases (DBs) are internally inconsistent in that the description of gene functions for genes with partial EC numbers is inconsistent with the activities implied by reactions to which the genes were assigned. We infer that these inconsistencies result from the processing used to match gene products to reactions within KEGG's metabolic pathways. These errors affect scientists who use these DBs as online encyclopedias and they affect bioinformaticists who use these DBs to train and validate newly developed algorithms.

Citing Articles

Proteomic profiling of zinc homeostasis mechanisms in through data-dependent and data-independent acquisition mass spectrometry.

Meyer A, Meyer A, McIlvin M, Lopez P, Searle B, Saito M bioRxiv. 2025; .

PMID: 39868216 PMC: 11761036. DOI: 10.1101/2025.01.13.632865.

Detecting anomalous proteins using deep representations.

Michael-Pitschaze T, Cohen N, Ofer D, Hoshen Y, Linial M NAR Genom Bioinform. 2024; 6(1):lqae021.

PMID: 38486884 PMC: 10939404. DOI: 10.1093/nargab/lqae021.

Evidential deep learning for trustworthy prediction of enzyme commission number.

Han S, Park M, Kosaraju S, Lee J, Lee H, Lee J Brief Bioinform. 2023; 25(1).

PMID: 37991247 PMC: 10664415. DOI: 10.1093/bib/bbad401.

Bactabolize is a tool for high-throughput generation of bacterial strain-specific metabolic models.

Vezina B, Watts S, Hawkey J, Cooper H, Judd L, Jenney A Elife. 2023; 12.

PMID: 37815531 PMC: 10564454. DOI: 10.7554/eLife.87406.

Transcriptional Landscape of Ectomycorrhizal Fungi and Their Host Provides Insight into N Uptake from Forest Soil.

Perez C, Janz D, Schneider D, Daniel R, Polle A mSystems. 2022; 7(1):e0095721.

PMID: 35089084 PMC: 8725588. DOI: 10.1128/mSystems.00957-21.

References

Yanai I, Mellor J, DeLisi C . Identifying functional links between genes using conserved chromosomal proximity. Trends Genet. 2002; 18(4):176-9. DOI: 10.1016/s0168-9525(01)02621-x. View

Karp P, Paley S, Romero P . The Pathway Tools software. Bioinformatics. 2002; 18 Suppl 1:S225-32. DOI: 10.1093/bioinformatics/18.suppl_1.s225. View

von Mering C, Huynen M, Jaeggi D, Schmidt S, Bork P, Snel B . STRING: a database of predicted functional associations between proteins. Nucleic Acids Res. 2003; 31(1):258-61. PMC: 165481. DOI: 10.1093/nar/gkg034. View

Wu J, Kasif S, DeLisi C . Identification of functional links between genes using phylogenetic profiles. Bioinformatics. 2003; 19(12):1524-30. DOI: 10.1093/bioinformatics/btg187. View

Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M . The KEGG resource for deciphering the genome. Nucleic Acids Res. 2003; 32(Database issue):D277-80. PMC: 308797. DOI: 10.1093/nar/gkh063. View

Tringe S, von Mering C, Kobayashi A, Salamov A, Chen K, Chang H . Comparative metagenomics of microbial communities. Science. 2005; 308(5721):554-7. DOI: 10.1126/science.1107851. View

Enault F, Suhre K, Poirot O, Abergel C, Claverie J . Phydbac2: improved inference of gene function using interactive phylogenomic profiling and chromosomal location analysis. Nucleic Acids Res. 2004; 32(Web Server issue):W336-9. PMC: 441503. DOI: 10.1093/nar/gkh365. View

Eichler K, Buchet A, Lemke R, Kleber H, Mandrand-Berthelot M . Identification and characterization of the caiF gene encoding a potential transcriptional activator of carnitine metabolism in Escherichia coli. J Bacteriol. 1996; 178(5):1248-57. PMC: 177796. DOI: 10.1128/jb.178.5.1248-1257.1996. View

Keseler I, Collado-Vides J, Gama-Castro S, Ingraham J, Paley S, Paulsen I . EcoCyc: a comprehensive database resource for Escherichia coli. Nucleic Acids Res. 2004; 33(Database issue):D334-7. PMC: 540062. DOI: 10.1093/nar/gki108. View

10.

von Mering C, Jensen L, Snel B, Hooper S, Krupp M, Foglierini M . STRING: known and predicted protein-protein associations, integrated and transferred across organisms. Nucleic Acids Res. 2004; 33(Database issue):D433-7. PMC: 539959. DOI: 10.1093/nar/gki005. View

11.

von Mering C, Zdobnov E, Tsoka S, Ciccarelli F, Pereira-Leal J, Ouzounis C . Genome evolution reveals biochemical networks and functional modules. Proc Natl Acad Sci U S A. 2003; 100(26):15428-33. PMC: 307584. DOI: 10.1073/pnas.2136809100. View