Sources of Systematic Error in Functional Annotation of Genomes: Domain Rearrangement, Non-orthologous Gene Displacement and Operon Disruption

Overview

Journal In Silico Biol

Publisher Sage Publications

Specialty Biology

Date 2001 Jul 27

PMID 11471243

Citations 81

Authors

M Y Galperin

E V Koonin

Affiliations

Soon will be listed here.

Abstract

Functional annotation of proteins encoded in newly sequenced genomes can be expected to meet two conflicting objectives: (i) provide as much information as possible, and (ii) avoid erroneous functional assignments and over-predictions. The continuing exponential growth of the number of sequenced genomes makes the quality of sequence annotation a critical factor in the efforts to utilize this new information. When dubious functional assignments are used as a basis for subsequent predictions, they tend to proliferate, leading to "database explosion". It is therefore important to identify the common factors that hamper functional annotation. As a first step towards that goal, we have compared the annotations of the Mycoplasma genitalium and Methanococcus jannaschii genomes produced in several independent studies. The most common causes of questionable predictions appear to be: i) non-critical use of annotations from existing database entries; ii) taking into account only the annotation of the best database hit; iii) insufficient masking of low complexity regions (e.g. non-globular domains) in protein sequences, resulting in spurious database hits obscuring relevant ones; iv) ignoring multi-domain organization of the query proteins and/or the database hits; v) non-critical functional inferences on the basis of the functions of neighboring genes in an operon; vi) non-orthologous gene displacement, i.e. involvement of structurally unrelated proteins in the same function. These observations suggest that case by case validation of functional annotation by expert biologists remains crucial for productive genome analysis.

Citing Articles

The TriMet_DB: A Manually Curated Database of the Metabolic Proteins of .

Cunsolo V, Di Francesco A, Pittala M, Saletti R, Foti S Nutrients. 2022; 14(24).

PMID: 36558536 PMC: 9781733. DOI: 10.3390/nu14245377.

PASS: Protein Annotation Surveillance Site for Protein Annotation Using Homologous Clusters, NLP, and Sequence Similarity Networks.

Tao J, Brayton K, Broschat S Front Bioinform. 2022; 1:749008.

PMID: 36303767 PMC: 9581018. DOI: 10.3389/fbinf.2021.749008.

FA-nf: A Functional Annotation Pipeline for Proteins from Non-Model Organisms Implemented in Nextflow.

Vlasova A, Hermoso Pulido T, Camara F, Ponomarenko J, Guigo R Genes (Basel). 2021; 12(10).

PMID: 34681040 PMC: 8535801. DOI: 10.3390/genes12101645.

Accurate annotation of protein coding sequences with IDTAXA.

Cooley N, Wright E NAR Genom Bioinform. 2021; 3(3):lqab080.

PMID: 34541527 PMC: 8445202. DOI: 10.1093/nargab/lqab080.

gapseq: informed prediction of bacterial metabolic pathways and reconstruction of accurate metabolic models.

Zimmermann J, Kaleta C, Waschina S Genome Biol. 2021; 22(1):81.

PMID: 33691770 PMC: 7949252. DOI: 10.1186/s13059-021-02295-1.