Pre-trained Models, Data Augmentation, and Ensemble Learning for Biomedical Information Extraction and Document Classification

Overview

Journal Database (Oxford)

Specialty Biology

Date 2022 Aug 13

PMID 35962559

Authors

Arslan Erdengasileng

Qing Han

Tingting Zhao

Shubo Tian

Xin Sui

Keqiao Li

Wanjing Wang

Jian Wang

Ting Hu

Feng Pan

Yuan Zhang

Jinfeng Zhang

Affiliations

Soon will be listed here.

Abstract

Large volumes of publications are being produced in biomedical sciences nowadays with ever-increasing speed. To deal with the large amount of unstructured text data, effective natural language processing (NLP) methods need to be developed for various tasks such as document classification and information extraction. BioCreative Challenge was established to evaluate the effectiveness of information extraction methods in biomedical domain and facilitate their development as a community-wide effort. In this paper, we summarize our work and what we have learned from the latest round, BioCreative Challenge VII, where we participated in all five tracks. Overall, we found three key components for achieving high performance across a variety of NLP tasks: (1) pre-trained NLP models; (2) data augmentation strategies and (3) ensemble modelling. These three strategies need to be tailored towards the specific tasks at hands to achieve high-performing baseline models, which are usually good enough for practical applications. When further combined with task-specific methods, additional improvements (usually rather small) can be achieved, which might be critical for winning competitions. Database URL: https://doi.org/10.1093/database/baac066.

References

Qu J, Steppi A, Zhong D, Hao J, Wang J, Lung P . Triage of documents containing protein interactions affected by mutations using an NLP based machine learning approach. BMC Genomics. 2020; 21(1):773. PMC: 7654050. DOI: 10.1186/s12864-020-07185-7. View

Huang C, Lu Z . Community challenges in biomedical text mining over 10 years: success, failure and the future. Brief Bioinform. 2015; 17(1):132-44. PMC: 4719069. DOI: 10.1093/bib/bbv024. View

Balaji S, Mcclendon C, Chowdhary R, Liu J, Zhang J . IMID: integrated molecular interaction database. Bioinformatics. 2012; 28(5):747-9. PMC: 3289914. DOI: 10.1093/bioinformatics/bts010. View

Leaman R, Islamaj R, Adams V, Alliheedi M, Almeida J, Antunes R . Chemical identification and indexing in full-text articles: an overview of the NLM-Chem track at BioCreative VII. Database (Oxford). 2023; 2023. PMC: 9991492. DOI: 10.1093/database/baad005. View

Chowdhary R, Zhang J, Tan S, Osborne D, Bajic V, Liu J . PIMiner: a web tool for extraction of protein interactions from biomedical literature. Int J Data Min Bioinform. 2013; 7(4):450-62. PMC: 4303605. DOI: 10.1504/ijdmb.2013.054232. View

Chowdhary R, Tan S, Zhang J, Karnik S, Bajic V, Liu J . Context-specific protein network miner--an online system for exploring context-specific protein interaction networks from the literature. PLoS One. 2012; 7(4):e34480. PMC: 3321019. DOI: 10.1371/journal.pone.0034480. View

Pyysalo S, Ginter F, Heimonen J, Bjorne J, Boberg J, Jarvinen J . BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinformatics. 2007; 8:50. PMC: 1808065. DOI: 10.1186/1471-2105-8-50. View

Lee J, Yoon W, Kim S, Kim D, Kim S, So C . BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2019; 36(4):1234-1240. PMC: 7703786. DOI: 10.1093/bioinformatics/btz682. View

Krallinger M, Morgan A, Smith L, Leitner F, Tanabe L, Wilbur J . Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge. Genome Biol. 2008; 9 Suppl 2:S1. PMC: 2559980. DOI: 10.1186/gb-2008-9-s2-s1. View

10.

Chen Q, Allot A, Lu Z . Keep up with the latest coronavirus research. Nature. 2020; 579(7798):193. DOI: 10.1038/d41586-020-00694-1. View

11.

Hirschman L, Yeh A, Blaschke C, Valencia A . Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics. 2005; 6 Suppl 1:S1. PMC: 1869002. DOI: 10.1186/1471-2105-6-S1-S1. View

12.

Wei C, Allot A, Leaman R, Lu Z . PubTator central: automated concept annotation for biomedical full text articles. Nucleic Acids Res. 2019; 47(W1):W587-W593. PMC: 6602571. DOI: 10.1093/nar/gkz389. View

13.

Arighi C, Lu Z, Krallinger M, Cohen K, Wilbur W, Valencia A . Overview of the BioCreative III Workshop. BMC Bioinformatics. 2011; 12 Suppl 8:S1. PMC: 3269932. DOI: 10.1186/1471-2105-12-S8-S1. View

14.

Sohn S, Comeau D, Kim W, Wilbur W . Abbreviation definition identification based on automatic precision estimates. BMC Bioinformatics. 2008; 9:402. PMC: 2576267. DOI: 10.1186/1471-2105-9-402. View

15.

Chowdhary R, Zhang J, Liu J . Bayesian inference of protein-protein interactions from biological literature. Bioinformatics. 2009; 25(12):1536-42. PMC: 2732911. DOI: 10.1093/bioinformatics/btp245. View

16.

Bell L, Chowdhary R, Liu J, Niu X, Zhang J . Integrated bio-entity network: a system for biological knowledge discovery. PLoS One. 2011; 6(6):e21474. PMC: 3124513. DOI: 10.1371/journal.pone.0021474. View

17.

Leitner F, Krallinger M, Rodriguez-Penagos C, Hakenberg J, Plake C, Kuo C . Introducing meta-services for biomedical information extraction. Genome Biol. 2008; 9 Suppl 2:S6. PMC: 2559990. DOI: 10.1186/gb-2008-9-s2-s6. View

18.

Rzhetsky A, Seringhaus M, Gerstein M . Seeking a new biology through text mining. Cell. 2008; 134(1):9-13. PMC: 2735884. DOI: 10.1016/j.cell.2008.06.029. View

19.

Chen Q, Allot A, Lu Z . LitCovid: an open database of COVID-19 literature. Nucleic Acids Res. 2020; 49(D1):D1534-D1540. PMC: 7778958. DOI: 10.1093/nar/gkaa952. View

20.

Lung P, He Z, Zhao T, Yu D, Zhang J . Extracting chemical-protein interactions from literature using sentence structure analysis and feature engineering. Database (Oxford). 2019; 2019. PMC: 6323317. DOI: 10.1093/database/bay138. View