Improving Natural Language Information Extraction from Cancer Pathology Reports Using Transfer Learning and Zero-shot String Similarity

Overview

Journal JAMIA Open

Date 2021 Oct 4

PMID 34604711

Citations 1

Authors

Briton Park

Nicholas Altieri

John DeNero

Anobel Y Odisho

Bin Yu

Affiliations

Soon will be listed here.

Abstract

Objective: We develop natural language processing (NLP) methods capable of accurately classifying tumor attributes from pathology reports given minimal labeled examples. Our hierarchical cancer to cancer transfer (HCTC) and zero-shot string similarity (ZSS) methods are designed to exploit shared information between cancers and auxiliary class features, respectively, to boost performance using enriched annotations which give both location-based information and document level labels for each pathology report.

Materials And Methods: Our data consists of 250 pathology reports each for kidney, colon, and lung cancer from 2002 to 2019 from a single institution (UCSF). For each report, we classified 5 attributes: procedure, tumor location, histology, grade, and presence of lymphovascular invasion. We develop novel NLP techniques involving transfer learning and string similarity trained on enriched annotations. We compare HCTC and ZSS methods to the state-of-the-art including conventional machine learning methods as well as deep learning methods.

Results: For our HCTC method, we see an improvement of up to 0.1 micro-F1 score and 0.04 macro-F1 averaged across cancer and applicable attributes. For our ZSS method, we see an improvement of up to 0.26 micro-F1 and 0.23 macro-F1 averaged across cancer and applicable attributes. These comparisons are made after adjusting training data sizes to correct for the 20% increase in annotation time for enriched annotations compared to ordinary annotations.

Conclusions: Methods based on transfer learning across cancers and augmenting information methods with string similarity priors can significantly reduce the amount of labeled data needed for accurate information extraction from pathology reports.

Citing Articles

TCGA-Reports: A machine-readable pathology report resource for benchmarking text-based AI models.

Kefeli J, Tatonetti N Patterns (N Y). 2024; 5(3):100933.

PMID: 38487800 PMC: 10935496. DOI: 10.1016/j.patter.2024.100933.

References

Zhou G, Zhang J, Su J, Shen D, Tan C . Recognizing names in biomedical texts: a machine learning approach. Bioinformatics. 2004; 20(7):1178-90. DOI: 10.1093/bioinformatics/bth060. View

Burger G, Abu-Hanna A, de Keizer N, Cornet R . Natural language processing in pathology: a scoping review. J Clin Pathol. 2016; . DOI: 10.1136/jclinpath-2016-203872. View

Alawad M, Gao S, Qiu J, Schaefferkoetter N, Hinkle J, Yoon H . Deep Transfer Learning Across Cancer Registries for Information Extraction from Pathology Reports. IEEE EMBS Int Conf Biomed Health Inform. 2022; 2019. PMC: 9450101. DOI: 10.1109/bhi.2019.8834586. View

Lee J, Scott D, Villarroel M, Clifford G, Saeed M, Mark R . Open-access MIMIC-II database for intensive care research. Annu Int Conf IEEE Eng Med Biol Soc. 2012; 2011:8315-8. PMC: 6339457. DOI: 10.1109/IEMBS.2011.6092050. View

Napolitano G, Fox C, Middleton R, Connolly D . Pattern-based information extraction from pathology reports for cancer registration. Cancer Causes Control. 2010; 21(11):1887-94. DOI: 10.1007/s10552-010-9616-4. View

Odisho A, Park B, Altieri N, DeNero J, Cooperberg M, Carroll P . Natural language processing systems for pathology parsing in limited data environments with uncertainty estimation. JAMIA Open. 2020; 3(3):431-438. PMC: 7751177. DOI: 10.1093/jamiaopen/ooaa029. View

Alawad M, Gao S, Qiu J, Yoon H, Christian J, Penberthy L . Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks. J Am Med Inform Assoc. 2019; 27(1):89-98. PMC: 7489089. DOI: 10.1093/jamia/ocz153. View

Xian Y, Lampert C, Schiele B, Akata Z . Zero-Shot Learning-A Comprehensive Evaluation of the Good, the Bad and the Ugly. IEEE Trans Pattern Anal Mach Intell. 2018; 41(9):2251-2265. DOI: 10.1109/TPAMI.2018.2857768. View

Altieri N, Park B, Olson M, DeNero J, Odisho A, Yu B . Supervised line attention for tumor attribute classification from pathology reports: Higher performance with less data. J Biomed Inform. 2021; 122:103872. DOI: 10.1016/j.jbi.2021.103872. View

10.

Nguyen A, Lawley M, Hansen D, Bowman R, Clarke B, Duhig E . Symbolic rule-based classification of lung cancer stages from free-text pathology reports. J Am Med Inform Assoc. 2010; 17(4):440-5. PMC: 2995652. DOI: 10.1136/jamia.2010.003707. View

11.

Gao S, Young M, Qiu J, Yoon H, Christian J, Fearn P . Hierarchical attention networks for information extraction from cancer pathology reports. J Am Med Inform Assoc. 2017; 25(3):321-330. PMC: 7282502. DOI: 10.1093/jamia/ocx131. View

12.

Schroeck F, Patterson O, Alba P, Pattison E, Seigne J, DuVall S . Development of a Natural Language Processing Engine to Generate Bladder Cancer Pathology Data for Health Services Research. Urology. 2017; 110:84-91. PMC: 5696035. DOI: 10.1016/j.urology.2017.07.056. View

13.

Rios A, Kavuluru R . Few-Shot and Zero-Shot Multi-Label Learning for Structured Label Spaces. Proc Conf Empir Methods Nat Lang Process. 2019; 2018:3132-3142. PMC: 6375489. View

14.

Johnson A, Pollard T, Shen L, Lehman L, Feng M, Ghassemi M . MIMIC-III, a freely accessible critical care database. Sci Data. 2016; 3:160035. PMC: 4878278. DOI: 10.1038/sdata.2016.35. View

15.

Qiu J, Yoon H, Fearn P, Tourassi G . Deep Learning for Automated Extraction of Primary Sites From Cancer Pathology Reports. IEEE J Biomed Health Inform. 2017; 22(1):244-251. DOI: 10.1109/JBHI.2017.2700722. View

16.

Wang Y, Wang L, Rastegar-Mojarad M, Moon S, Shen F, Afzal N . Clinical information extraction applications: A literature review. J Biomed Inform. 2017; 77:34-49. PMC: 5771858. DOI: 10.1016/j.jbi.2017.11.011. View

17.

Yim W, Yetisgen M, Harris W, Kwan S . Natural Language Processing in Oncology: A Review. JAMA Oncol. 2016; 2(6):797-804. DOI: 10.1001/jamaoncol.2016.0213. View

18.

Mykowiecka A, Marciniak M, Kupsc A . Rule-based information extraction from patients' clinical data. J Biomed Inform. 2009; 42(5):923-36. DOI: 10.1016/j.jbi.2009.07.007. View

19.

Yala A, Barzilay R, Salama L, Griffin M, Sollender G, Bardia A . Using machine learning to parse breast pathology reports. Breast Cancer Res Treat. 2016; 161(2):203-211. DOI: 10.1007/s10549-016-4035-1. View