» Articles » PMID: 24922051

Negative Example Selection for Protein Function Prediction: the NoGO Database

Overview
Specialty Biology
Date 2014 Jun 13
PMID 24922051
Citations 16
Authors
Affiliations
Soon will be listed here.
Abstract

Negative examples - genes that are known not to carry out a given protein function - are rarely recorded in genome and proteome annotation databases, such as the Gene Ontology database. Negative examples are required, however, for several of the most powerful machine learning methods for integrative protein function prediction. Most protein function prediction efforts have relied on a variety of heuristics for the choice of negative examples. Determining the accuracy of methods for negative example prediction is itself a non-trivial task, given that the Open World Assumption as applied to gene annotations rules out many traditional validation metrics. We present a rigorous comparison of these heuristics, utilizing a temporal holdout, and a novel evaluation strategy for negative examples. We add to this comparison several algorithms adapted from Positive-Unlabeled learning scenarios in text-classification, which are the current state of the art methods for generating negative examples in low-density annotation contexts. Lastly, we present two novel algorithms of our own construction, one based on empirical conditional probability, and the other using topic modeling applied to genes and annotations. We demonstrate that our algorithms achieve significantly fewer incorrect negative example predictions than the current state of the art, using multiple benchmarks covering multiple organisms. Our methods may be applied to generate negative examples for any type of method that deals with protein function, and to this end we provide a database of negative examples in several well-studied organisms, for general use (The NoGO database, available at: bonneaulab.bio.nyu.edu/nogo.html).

Citing Articles

Computational Methods for Prediction of Human Protein-Phenotype Associations: A Review.

Liu L, Zhu S Phenomics. 2023; 1(4):171-185.

PMID: 36939789 PMC: 9590544. DOI: 10.1007/s43657-021-00019-w.


Defining the extent of gene function using ROC curvature.

Fischer S, Gillis J Bioinformatics. 2022; 38(24):5390-5397.

PMID: 36271855 PMC: 9750128. DOI: 10.1093/bioinformatics/btac692.


Bioinformatic Analyses of Peroxiredoxins and RF-Prx: A Random Forest-Based Predictor and Classifier for Prxs.

Al-Barakati H, Newman R, Kc D, Poole L Methods Mol Biol. 2022; 2499:155-176.

PMID: 35696080 PMC: 9844236. DOI: 10.1007/978-1-0716-2317-6_8.


A roadmap for metagenomic enzyme discovery.

Robinson S, Piel J, Sunagawa S Nat Prod Rep. 2021; 38(11):1994-2023.

PMID: 34821235 PMC: 8597712. DOI: 10.1039/d1np00006c.


Automatic Gene Function Prediction in the 2020's.

Makrodimitris S, van Ham R, Reinders M Genes (Basel). 2020; 11(11).

PMID: 33120976 PMC: 7692357. DOI: 10.3390/genes11111264.


References
1.
Jones P, Binns D, Chang H, Fraser M, Li W, McAnulla C . InterProScan 5: genome-scale protein function classification. Bioinformatics. 2014; 30(9):1236-40. PMC: 3998142. DOI: 10.1093/bioinformatics/btu031. View

2.
King O, Foulger R, Dwight S, White J, Roth F . Predicting gene function from patterns of annotation. Genome Res. 2003; 13(5):896-904. PMC: 430892. DOI: 10.1101/gr.440803. View

3.
Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J . Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000; 25(1):25-9. PMC: 3037419. DOI: 10.1038/75556. View

4.
Gomez S, Noble W, Rzhetsky A . Learning to predict protein-protein interactions from protein sequences. Bioinformatics. 2003; 19(15):1875-81. DOI: 10.1093/bioinformatics/btg352. View

5.
Warde-Farley D, Donaldson S, Comes O, Zuberi K, Badrawi R, Chao P . The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Res. 2010; 38(Web Server issue):W214-20. PMC: 2896186. DOI: 10.1093/nar/gkq537. View