» Articles » PMID: 15960836

Data-poor Categorization and Passage Retrieval for Gene Ontology Annotation in Swiss-Prot

Overview
Publisher Biomed Central
Specialty Biology
Date 2005 Jun 18
PMID 15960836
Citations 14
Authors
Affiliations
Soon will be listed here.
Abstract

Background: In the context of the BioCreative competition, where training data were very sparse, we investigated two complementary tasks: 1) given a Swiss-Prot triplet, containing a protein, a GO (Gene Ontology) term and a relevant article, extraction of a short passage that justifies the GO category assignment; 2) given a Swiss-Prot pair, containing a protein and a relevant article, automatic assignment of a set of categories.

Methods: Sentence is the basic retrieval unit. Our classifier computes a distance between each sentence and the GO category provided with the Swiss-Prot entry. The Text Categorizer computes a distance between each GO term and the text of the article. Evaluations are reported both based on annotator judgements as established by the competition and based on mean average precision measures computed using a curated sample of Swiss-Prot.

Results: Our system achieved the best recall and precision combination both for passage retrieval and text categorization as evaluated by official evaluators. However, text categorization results were far below those in other data-poor text categorization experiments The top proposed term is relevant in less that 20% of cases, while categorization with other biomedical controlled vocabulary, such as the Medical Subject Headings, we achieved more than 90% precision. We also observe that the scoring methods used in our experiments, based on the retrieval status value of our engines, exhibits effective confidence estimation capabilities.

Conclusion: From a comparative perspective, the combination of retrieval and natural language processing methods we designed, achieved very competitive performances. Largely data-independent, our systems were no less effective that data-intensive approaches. These results suggests that the overall strategy could benefit a large class of information extraction tasks, especially when training data are missing. However, from a user perspective, results were disappointing. Further investigations are needed to design applicable end-user text mining tools for biologists.

Citing Articles

Variomes: a high recall search engine to support the curation of genomic variants.

Pasche E, Mottaz A, Caucheteur D, Gobeill J, Michel P, Ruch P Bioinformatics. 2022; 38(9):2595-2601.

PMID: 35274687 PMC: 9048643. DOI: 10.1093/bioinformatics/btac146.


Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases.

Gobeill J, Pasche E, Vishnyakova D, Ruch P Database (Oxford). 2013; 2013:bat041.

PMID: 23842461 PMC: 3706742. DOI: 10.1093/database/bat041.


Using binary classification to prioritize and curate articles for the Comparative Toxicogenomics Database.

Vishnyakova D, Pasche E, Ruch P Database (Oxford). 2012; 2012:bas050.

PMID: 23221176 PMC: 3514750. DOI: 10.1093/database/bas050.


Comparing a Rule Based vs. Statistical System for Automatic Categorization of MEDLINE Documents According to Biomedical Specialty.

Humphrey S, Neveol A, Gobeil J, Ruch P, Darmoni S, Browne A J Am Soc Inf Sci Technol. 2009; 60(12):2530-2539.

PMID: 19956557 PMC: 2782854. DOI: 10.1002/asi.21170.


Automatic medical encoding with SNOMED categories.

Ruch P, Gobeill J, Lovis C, Geissbuhler A BMC Med Inform Decis Mak. 2008; 8 Suppl 1:S6.

PMID: 19007443 PMC: 2582793. DOI: 10.1186/1472-6947-8-S1-S6.


References
1.
Verspoor K, Cohn J, Joslyn C, Mniszewski S, Rechtsteiner A, Rocha L . Protein annotation as term categorization in the gene ontology using word proximity networks. BMC Bioinformatics. 2005; 6 Suppl 1:S20. PMC: 1869013. DOI: 10.1186/1471-2105-6-S1-S20. View

2.
Stolz W . A probabilistic procedure for grouping words into phrases. Lang Speech. 1965; 8(4):219-35. DOI: 10.1177/002383096500800404. View

3.
Hirschman L, Park J, Tsujii J, Wong L, Wu C . Accomplishments and challenges in literature data mining for biology. Bioinformatics. 2002; 18(12):1553-61. DOI: 10.1093/bioinformatics/18.12.1553. View

4.
Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J . Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000; 25(1):25-9. PMC: 3037419. DOI: 10.1038/75556. View

5.
Hersh W, Bhupatiraju R, Corley S . Enhancing access to the Bibliome: the TREC Genomics Track. Stud Health Technol Inform. 2004; 107(Pt 2):773-7. View