» Articles » PMID: 26131352

Ambiguity and Variability of Database and Software Names in Bioinformatics

Overview
Publisher Biomed Central
Date 2015 Jul 2
PMID 26131352
Citations 2
Authors
Affiliations
Soon will be listed here.
Abstract

Background: There are numerous options available to achieve various tasks in bioinformatics, but until recently, there were no tools that could systematically identify mentions of databases and tools within the literature. In this paper we explore the variability and ambiguity of database and software name mentions and compare dictionary and machine learning approaches to their identification.

Results: Through the development and analysis of a corpus of 60 full-text documents manually annotated at the mention level, we report high variability and ambiguity in database and software mentions. On a test set of 25 full-text documents, a baseline dictionary look-up achieved an F-score of 46 %, highlighting not only variability and ambiguity but also the extensive number of new resources introduced. A machine learning approach achieved an F-score of 63 % (with precision of 74 %) and 70 % (with precision of 83 %) for strict and lenient matching respectively. We characterise the issues with various mention types and propose potential ways of capturing additional database and software mentions in the literature.

Conclusions: Our analyses show that identification of mentions of databases and tools is a challenging task that cannot be achieved by relying on current manually-curated resource repositories. Although machine learning shows improvement and promise (primarily in precision), more contextual information needs to be taken into account to achieve a good degree of accuracy.

Citing Articles

A Survey of Bioinformatics Database and Software Usage through Mining the Literature.

Duck G, Nenadic G, Filannino M, Brass A, Robertson D, Stevens R PLoS One. 2016; 11(6):e0157989.

PMID: 27331905 PMC: 4917176. DOI: 10.1371/journal.pone.0157989.


Ambiguity and variability of database and software names in bioinformatics.

Duck G, Kovacevic A, Robertson D, Stevens R, Nenadic G J Biomed Semantics. 2015; 6:29.

PMID: 26131352 PMC: 4485340. DOI: 10.1186/s13326-015-0026-0.

References
1.
Kovacevic A, Dehghan A, Filannino M, Keane J, Nenadic G . Combining rules and machine learning for extraction of temporal expressions and events from clinical narratives. J Am Med Inform Assoc. 2013; 20(5):859-66. PMC: 3756271. DOI: 10.1136/amiajnl-2013-001625. View

2.
Torii M, Hu Z, Song M, Wu C, Liu H . A comparison study on algorithms of detecting long forms for short forms in biomedical text. BMC Bioinformatics. 2007; 8 Suppl 9:S5. PMC: 2217663. DOI: 10.1186/1471-2105-8-S9-S5. View

3.
Chen Y, Chattopadhyay A, Bergen P, Gadd C, Tannery N . The Online Bioinformatics Resources Collection at the University of Pittsburgh Health Sciences Library System--a one-stop gateway to online bioinformatics databases and software tools. Nucleic Acids Res. 2006; 35(Database issue):D780-5. PMC: 1669712. DOI: 10.1093/nar/gkl781. View

4.
Zhou W, Torvik V, Smalheiser N . ADAM: another database of abbreviations in MEDLINE. Bioinformatics. 2006; 22(22):2813-8. DOI: 10.1093/bioinformatics/btl480. View

5.
Hirschman L, Yeh A, Blaschke C, Valencia A . Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics. 2005; 6 Suppl 1:S1. PMC: 1869002. DOI: 10.1186/1471-2105-6-S1-S1. View