Automatic Recognition of Conceptualization Zones in Scientific Articles and Two Life Science Applications

Overview

Journal Bioinformatics

Publisher Oxford University Press

Specialty Biology

Date 2012 Feb 11

PMID 22321698

Citations 15

Authors

Maria Liakata

Shyamasree Saha

Simon Dobnik

Colin Batchelor

Dietrich Rebholz-Schuhmann

Affiliations

Soon will be listed here.

Abstract

Motivation: Scholarly biomedical publications report on the findings of a research investigation. Scientists use a well-established discourse structure to relate their work to the state of the art, express their own motivation and hypotheses and report on their methods, results and conclusions. In previous work, we have proposed ways to explicitly annotate the structure of scientific investigations in scholarly publications. Here we present the means to facilitate automatic access to the scientific discourse of articles by automating the recognition of 11 categories at the sentence level, which we call Core Scientific Concepts (CoreSCs). These include: Hypothesis, Motivation, Goal, Object, Background, Method, Experiment, Model, Observation, Result and Conclusion. CoreSCs provide the structure and context to all statements and relations within an article and their automatic recognition can greatly facilitate biomedical information extraction by characterizing the different types of facts, hypotheses and evidence available in a scientific publication.

Results: We have trained and compared machine learning classifiers (support vector machines and conditional random fields) on a corpus of 265 full articles in biochemistry and chemistry to automatically recognize CoreSCs. We have evaluated our automatic classifications against a manually annotated gold standard, and have achieved promising accuracies with 'Experiment', 'Background' and 'Model' being the categories with the highest F1-scores (76%, 62% and 53%, respectively). We have analysed the task of CoreSC annotation both from a sentence classification as well as sequence labelling perspective and we present a detailed feature evaluation. The most discriminative features are local sentence features such as unigrams, bigrams and grammatical dependencies while features encoding the document structure, such as section headings, also play an important role for some of the categories. We discuss the usefulness of automatically generated CoreSCs in two biomedical applications as well as work in progress.

Availability: A web-based tool for the automatic annotation of articles with CoreSCs and corresponding documentation is available online at http://www.sapientaproject.com/software http://www.sapientaproject.com also contains detailed information pertaining to CoreSC annotation and links to annotation guidelines as well as a corpus of manually annotated articles, which served as our training data.

Contact: liakata@ebi.ac.uk

Supplementary Information: Supplementary data are available at Bioinformatics online.

Citing Articles

Contexts and contradictions: a roadmap for computational drug repurposing with knowledge inference.

Sosa D, Altman R Brief Bioinform. 2022; 23(4).

PMID: 35817308 PMC: 9294417. DOI: 10.1093/bib/bbac268.

Accelerating annotation of articles via automated approaches: evaluation of the neXtA5 curation-support tool by neXtProt.

Britan A, Cusin I, Hinard V, Mottin L, Pasche E, Gobeill J Database (Oxford). 2018; 2018.

PMID: 30576492 PMC: 6301339. DOI: 10.1093/database/bay129.

Identification of research hypotheses and new knowledge from scientific literature.

Shardlow M, Batista-Navarro R, Thompson P, Nawaz R, McNaught J, Ananiadou S BMC Med Inform Decis Mak. 2018; 18(1):46.

PMID: 29940927 PMC: 6019216. DOI: 10.1186/s12911-018-0639-1.

Authorship identification of documents with high content similarity.

Rexha A, Kroll M, Ziak H, Kern R Scientometrics. 2018; 115(1):223-237.

PMID: 29527072 PMC: 5838116. DOI: 10.1007/s11192-018-2661-6.

Biomedical text mining for research rigor and integrity: tasks, challenges, directions.

Kilicoglu H Brief Bioinform. 2017; 19(6):1400-1414.

PMID: 28633401 PMC: 6291799. DOI: 10.1093/bib/bbx057.

References

Kilicoglu H, Bergler S . Recognizing speculative language in biomedical research articles: a linguistically motivated perspective. BMC Bioinformatics. 2008; 9 Suppl 11:S10. PMC: 2586760. DOI: 10.1186/1471-2105-9-S11-S10. View

Cohen A, Hersh W . A survey of current work in biomedical text mining. Brief Bioinform. 2005; 6(1):57-71. DOI: 10.1093/bib/6.1.57. View

Shatkay H, Pan F, Rzhetsky A, Wilbur W . Multi-dimensional classification of biomedical text: toward automated, practical provision of high-utility text to diverse users. Bioinformatics. 2008; 24(18):2086-93. PMC: 2530883. DOI: 10.1093/bioinformatics/btn381. View

Mizuta Y, Korhonen A, Mullen T, Collier N . Zone analysis in biology articles as a basis for information extraction. Int J Med Inform. 2005; 75(6):468-87. DOI: 10.1016/j.ijmedinf.2005.06.013. View

Soldatova L, King R . An ontology of scientific experiments. J R Soc Interface. 2006; 3(11):795-803. PMC: 1885356. DOI: 10.1098/rsif.2006.0134. View

McKnight L, Srinivasan P . Categorization of sentence types in medical abstracts. AMIA Annu Symp Proc. 2004; :440-4. PMC: 1479904. View

Guo Y, Korhonen A, Liakata M, Silins I, Hogberg J, Stenius U . A comparison and user-based evaluation of models of textual information structure in the context of cancer risk assessment. BMC Bioinformatics. 2011; 12:69. PMC: 3060841. DOI: 10.1186/1471-2105-12-69. View

Ananiadou S, Pyysalo S, Tsujii J, Kell D . Event extraction for systems biology by text mining the literature. Trends Biotechnol. 2010; 28(7):381-90. DOI: 10.1016/j.tibtech.2010.04.005. View

Thompson P, Nawaz R, McNaught J, Ananiadou S . Enriching a biomedical event corpus with meta-knowledge annotation. BMC Bioinformatics. 2011; 12:393. PMC: 3222636. DOI: 10.1186/1471-2105-12-393. View

10.

Rimell L, Clark S . Porting a lexicalized-grammar parser to the biomedical domain. J Biomed Inform. 2009; 42(5):852-65. DOI: 10.1016/j.jbi.2008.12.004. View

11.

Ciccarese P, Wu E, Wong G, Ocana M, Kinoshita J, Ruttenberg A . The SWAN biomedical discourse ontology. J Biomed Inform. 2008; 41(5):739-51. PMC: 4536833. DOI: 10.1016/j.jbi.2008.04.010. View

12.

Wilbur W, Rzhetsky A, Shatkay H . New directions in biomedical text annotation: definitions, guidelines and corpus construction. BMC Bioinformatics. 2006; 7:356. PMC: 1559725. DOI: 10.1186/1471-2105-7-356. View

13.

Ruch P, Boyer C, Chichester C, Tbahriti I, Geissbuhler A, Fabry P . Using argumentation to extract key sentences from biomedical abstracts. Int J Med Inform. 2006; 76(2-3):195-200. DOI: 10.1016/j.ijmedinf.2006.05.002. View