» Articles » PMID: 39281753

LLM-AIx: An Open Source Pipeline for Information Extraction from Unstructured Medical Text Based on Privacy Preserving Large Language Models

Overview
Journal medRxiv
Date 2024 Sep 16
PMID 39281753
Authors
Affiliations
Soon will be listed here.
Abstract

In clinical science and practice, text data, such as clinical letters or procedure reports, is stored in an unstructured way. This type of data is not a quantifiable resource for any kind of quantitative investigations and any manual review or structured information retrieval is time-consuming and costly. The capabilities of Large Language Models (LLMs) mark a paradigm shift in natural language processing and offer new possibilities for structured Information Extraction (IE) from medical free text. This protocol describes a workflow for LLM based information extraction (LLM-AIx), enabling extraction of predefined entities from unstructured text using privacy preserving LLMs. By converting unstructured clinical text into structured data, LLM-AIx addresses a critical barrier in clinical research and practice, where the efficient extraction of information is essential for improving clinical decision-making, enhancing patient outcomes, and facilitating large-scale data analysis. The protocol consists of four main processing steps: 1) Problem definition and data preparation, 2) data preprocessing, 3) LLM-based IE and 4) output evaluation. LLM-AIx allows integration on local hospital hardware without the need of transferring any patient data to external servers. As example tasks, we applied LLM-AIx for the anonymization of fictitious clinical letters from patients with pulmonary embolism. Additionally, we extracted symptoms and laterality of the pulmonary embolism of these fictitious letters. We demonstrate troubleshooting for potential problems within the pipeline with an IE on a real-world dataset, 100 pathology reports from the Cancer Genome Atlas Program (TCGA), for TNM stage extraction. LLM-AIx can be executed without any programming knowledge via an easy-to-use interface and in no more than a few minutes or hours, depending on the LLM model selected.

References
1.
Gilardi F, Alizadeh M, Kubli M . ChatGPT outperforms crowd workers for text-annotation tasks. Proc Natl Acad Sci U S A. 2023; 120(30):e2305016120. PMC: 10372638. DOI: 10.1073/pnas.2305016120. View

2.
Ford E, Oswald M, Hassan L, Bozentko K, Nenadic G, Cassell J . Should free-text data in electronic medical records be shared for research? A citizens' jury study in the UK. J Med Ethics. 2020; 46(6):367-377. PMC: 7279205. DOI: 10.1136/medethics-2019-105472. View

3.
Shaukat A, Kaltenbach T, Dominitz J, Robertson D, Anderson J, Cruise M . Endoscopic Recognition and Management Strategies for Malignant Colorectal Polyps: Recommendations of the US Multi-Society Task Force on Colorectal Cancer. Gastrointest Endosc. 2020; 92(5):997-1015.e1. DOI: 10.1016/j.gie.2020.09.039. View

4.
Alkhalaf M, Yu P, Yin M, Deng C . Applying generative AI with retrieval augmented generation to summarize and extract key clinical information from electronic health records. J Biomed Inform. 2024; 156:104662. DOI: 10.1016/j.jbi.2024.104662. View

5.
Wiest I, Verhees F, Ferber D, Zhu J, Bauer M, Lewitzka U . Detection of suicidality from medical text using privacy-preserving large language models. Br J Psychiatry. 2024; 225(6):532-537. PMC: 11669470. DOI: 10.1192/bjp.2024.134. View