Sources of Systematic Error in Functional Annotation of Genomes: Domain Rearrangement, Non-orthologous Gene Displacement and Operon Disruption
Overview
Authors
Affiliations
Functional annotation of proteins encoded in newly sequenced genomes can be expected to meet two conflicting objectives: (i) provide as much information as possible, and (ii) avoid erroneous functional assignments and over-predictions. The continuing exponential growth of the number of sequenced genomes makes the quality of sequence annotation a critical factor in the efforts to utilize this new information. When dubious functional assignments are used as a basis for subsequent predictions, they tend to proliferate, leading to "database explosion". It is therefore important to identify the common factors that hamper functional annotation. As a first step towards that goal, we have compared the annotations of the Mycoplasma genitalium and Methanococcus jannaschii genomes produced in several independent studies. The most common causes of questionable predictions appear to be: i) non-critical use of annotations from existing database entries; ii) taking into account only the annotation of the best database hit; iii) insufficient masking of low complexity regions (e.g. non-globular domains) in protein sequences, resulting in spurious database hits obscuring relevant ones; iv) ignoring multi-domain organization of the query proteins and/or the database hits; v) non-critical functional inferences on the basis of the functions of neighboring genes in an operon; vi) non-orthologous gene displacement, i.e. involvement of structurally unrelated proteins in the same function. These observations suggest that case by case validation of functional annotation by expert biologists remains crucial for productive genome analysis.
The TriMet_DB: A Manually Curated Database of the Metabolic Proteins of .
Cunsolo V, Di Francesco A, Pittala M, Saletti R, Foti S Nutrients. 2022; 14(24).
PMID: 36558536 PMC: 9781733. DOI: 10.3390/nu14245377.
Tao J, Brayton K, Broschat S Front Bioinform. 2022; 1:749008.
PMID: 36303767 PMC: 9581018. DOI: 10.3389/fbinf.2021.749008.
Vlasova A, Hermoso Pulido T, Camara F, Ponomarenko J, Guigo R Genes (Basel). 2021; 12(10).
PMID: 34681040 PMC: 8535801. DOI: 10.3390/genes12101645.
Accurate annotation of protein coding sequences with IDTAXA.
Cooley N, Wright E NAR Genom Bioinform. 2021; 3(3):lqab080.
PMID: 34541527 PMC: 8445202. DOI: 10.1093/nargab/lqab080.
Zimmermann J, Kaleta C, Waschina S Genome Biol. 2021; 22(1):81.
PMID: 33691770 PMC: 7949252. DOI: 10.1186/s13059-021-02295-1.