LLM-AIx: An Open Source Pipeline for Information Extraction from Unstructured Medical Text Based on Privacy Preserving Large Language Models

Overview

Journal medRxiv

Date 2024 Sep 16

PMID 39281753

Authors

Isabella Catharina Wiest

Fabian Wolf

Marie-Elisabeth Lessmann

Marko van Treeck

Dyke Ferber

Jiefu Zhu

Heiko Boehme

Keno K Bressem

Hannes Ulrich

Matthias P Ebert

Jakob Nikolas Kather

Affiliations

Soon will be listed here.

Abstract

In clinical science and practice, text data, such as clinical letters or procedure reports, is stored in an unstructured way. This type of data is not a quantifiable resource for any kind of quantitative investigations and any manual review or structured information retrieval is time-consuming and costly. The capabilities of Large Language Models (LLMs) mark a paradigm shift in natural language processing and offer new possibilities for structured Information Extraction (IE) from medical free text. This protocol describes a workflow for LLM based information extraction (LLM-AIx), enabling extraction of predefined entities from unstructured text using privacy preserving LLMs. By converting unstructured clinical text into structured data, LLM-AIx addresses a critical barrier in clinical research and practice, where the efficient extraction of information is essential for improving clinical decision-making, enhancing patient outcomes, and facilitating large-scale data analysis. The protocol consists of four main processing steps: 1) Problem definition and data preparation, 2) data preprocessing, 3) LLM-based IE and 4) output evaluation. LLM-AIx allows integration on local hospital hardware without the need of transferring any patient data to external servers. As example tasks, we applied LLM-AIx for the anonymization of fictitious clinical letters from patients with pulmonary embolism. Additionally, we extracted symptoms and laterality of the pulmonary embolism of these fictitious letters. We demonstrate troubleshooting for potential problems within the pipeline with an IE on a real-world dataset, 100 pathology reports from the Cancer Genome Atlas Program (TCGA), for TNM stage extraction. LLM-AIx can be executed without any programming knowledge via an easy-to-use interface and in no more than a few minutes or hours, depending on the LLM model selected.

References

Gilardi F, Alizadeh M, Kubli M . ChatGPT outperforms crowd workers for text-annotation tasks. Proc Natl Acad Sci U S A. 2023; 120(30):e2305016120. PMC: 10372638. DOI: 10.1073/pnas.2305016120. View

Ford E, Oswald M, Hassan L, Bozentko K, Nenadic G, Cassell J . Should free-text data in electronic medical records be shared for research? A citizens' jury study in the UK. J Med Ethics. 2020; 46(6):367-377. PMC: 7279205. DOI: 10.1136/medethics-2019-105472. View

Shaukat A, Kaltenbach T, Dominitz J, Robertson D, Anderson J, Cruise M . Endoscopic Recognition and Management Strategies for Malignant Colorectal Polyps: Recommendations of the US Multi-Society Task Force on Colorectal Cancer. Gastrointest Endosc. 2020; 92(5):997-1015.e1. DOI: 10.1016/j.gie.2020.09.039. View

Alkhalaf M, Yu P, Yin M, Deng C . Applying generative AI with retrieval augmented generation to summarize and extract key clinical information from electronic health records. J Biomed Inform. 2024; 156:104662. DOI: 10.1016/j.jbi.2024.104662. View

Wiest I, Verhees F, Ferber D, Zhu J, Bauer M, Lewitzka U . Detection of suicidality from medical text using privacy-preserving large language models. Br J Psychiatry. 2024; 225(6):532-537. PMC: 11669470. DOI: 10.1192/bjp.2024.134. View

Landolsi M, Hlaoua L, Ben Romdhane L . Information extraction from electronic medical documents: state of the art and future research directions. Knowl Inf Syst. 2022; 65(2):463-516. PMC: 9640816. DOI: 10.1007/s10115-022-01779-1. View

Ferber D, Wolflein G, Wiest I, Ligero M, Sainath S, Laleh N . In-context learning enables multimodal large language models to classify cancer pathology images. Nat Commun. 2024; 15(1):10104. PMC: 11582649. DOI: 10.1038/s41467-024-51465-9. View

Capurro D, Yetisgen M, Van Eaton E, Black R, Tarczy-Hornoch P . Availability of structured and unstructured clinical data for comparative effectiveness research and quality improvement: a multisite assessment. EGEMS (Wash DC). 2015; 2(1):1079. PMC: 4371483. DOI: 10.13063/2327-9214.1079. View

Ferlitsch M, Moss A, Hassan C, Bhandari P, Dumonceau J, Paspatis G . Colorectal polypectomy and endoscopic mucosal resection (EMR): European Society of Gastrointestinal Endoscopy (ESGE) Clinical Guideline. Endoscopy. 2017; 49(3):270-297. DOI: 10.1055/s-0043-102569. View

10.

Jensen P, Jensen L, Brunak S . Mining electronic health records: towards better research applications and clinical care. Nat Rev Genet. 2012; 13(6):395-405. DOI: 10.1038/nrg3208. View

11.

Neves M, Seva J . An extensive review of tools for manual annotation of documents. Brief Bioinform. 2019; 22(1):146-163. PMC: 7820865. DOI: 10.1093/bib/bbz130. View

12.

Gijsbers K, van der Schee L, van Veen T, van Berkel A, Boersma F, Bronkhorst C . Impact of ≥ 0.1-mm free resection margins on local intramural residual cancer after local excision of T1 colorectal cancer. Endosc Int Open. 2022; 10(4):E282-E290. PMC: 9274442. DOI: 10.1055/a-1736-6960. View

13.

Price S, Stapley S, Shephard E, Barraclough K, Hamilton W . Is omission of free text records a possible source of data loss and bias in Clinical Practice Research Datalink studies? A case-control study. BMJ Open. 2016; 6(5):e011664. PMC: 4874123. DOI: 10.1136/bmjopen-2016-011664. View

14.

Su P, Vijay-Shanker K . Investigation of improving the pre-training and fine-tuning of BERT model for biomedical relation extraction. BMC Bioinformatics. 2022; 23(1):120. PMC: 8978438. DOI: 10.1186/s12859-022-04642-w. View

15.

Moynihan D, Monaco S, Ting T, Narasimhalu K, Hsieh J, Kam S . Author Correction: Analysis and visualisation of electronic health records data to identify undiagnosed patients with rare genetic diseases. Sci Rep. 2024; 14(1):10084. PMC: 11066083. DOI: 10.1038/s41598-024-60776-2. View

16.

Truhn D, Loeffler C, Muller-Franzes G, Nebelung S, Hewitt K, Brandner S . Extracting structured information from unstructured histopathology reports using generative pre-trained transformer 4 (GPT-4). J Pathol. 2023; 262(3):310-319. DOI: 10.1002/path.6232. View

17.

Dagdelen J, Dunn A, Lee S, Walker N, Rosen A, Ceder G . Structured information extraction from scientific text with large language models. Nat Commun. 2024; 15(1):1418. PMC: 10869356. DOI: 10.1038/s41467-024-45563-x. View

18.

Meystre S, Haug P . Natural language processing to extract medical problems from electronic clinical documents: performance evaluation. J Biomed Inform. 2005; 39(6):589-99. DOI: 10.1016/j.jbi.2005.11.004. View

19.

Sezgin E, Hussain S, Rust S, Huang Y . Extracting Medical Information From Free-Text and Unstructured Patient-Generated Health Data Using Natural Language Processing Methods: Feasibility Study With Real-world Data. JMIR Form Res. 2023; 7:e43014. PMC: 10031450. DOI: 10.2196/43014. View

20.

Perez-Lopez R, Laleh N, Mahmood F, Kather J . A guide to artificial intelligence for cancer researchers. Nat Rev Cancer. 2024; 24(6):427-441. DOI: 10.1038/s41568-024-00694-7. View