Extracting and Integrating Data from Entire Electronic Health Records for Detecting Colorectal Cancer Cases

Overview

Journal AMIA Annu Symp Proc

Specialty Medical Informatics

Date 2011 Dec 24

PMID 22195222

Citations 45

Authors

Hua Xu

Zhenming Fu

Anushi Shah

Yukun Chen

Neeraja B Peterson

Qingxia Chen

Subramani Mani

Mia A Levy

Qi Dai

Josh C Denny

Affiliations

Soon will be listed here.

Abstract

Identification of a cohort of patients with specific diseases is an important step for clinical research that is based on electronic health records (EHRs). Informatics approaches combining structured EHR data, such as billing records, with narrative text data have demonstrated utility for such tasks. This paper describes an algorithm combining machine learning and natural language processing to detect patients with colorectal cancer (CRC) from entire EHRs at Vanderbilt University Hospital. We developed a general case detection method that consists of two steps: 1) extraction of positive CRC concepts from all clinical notes (document-level concept identification); and 2) determination of CRC cases using aggregated information from both clinical narratives and structured billing data (patient-level case determination). For each step, we compared performance of rule-based and machine-learning-based approaches. Using a manually reviewed data set containing 300 possible CRC patients (150 for training and 150 for testing), we showed that our method achieved F-measures of 0.996 for document level concept identification, and 0.93 for patient level case detection.

Citing Articles

Developing an Inpatient Electronic Medical Record Phenotype for Hospital-Acquired Pressure Injuries: Case Study Using Natural Language Processing Models.

Nurmambetova E, Pan J, Zhang Z, Wu G, Lee S, Southern D JMIR AI. 2024; 2:e41264.

PMID: 38875552 PMC: 11041460. DOI: 10.2196/41264.

Opportunities and challenges of 5G network technology toward precision medicine.

Kang C, Lee T, Lim W, Yeo W Clin Transl Sci. 2023; 16(11):2078-2094.

PMID: 37702288 PMC: 10651640. DOI: 10.1111/cts.13640.

Validation and Improvement of a Convolutional Neural Network to Predict the Involved Pathology in a Head and Neck Surgery Cohort.

Culie D, Schiappa R, Contu S, Scheller B, Villarme A, Dassonville O Int J Environ Res Public Health. 2022; 19(19).

PMID: 36231500 PMC: 9564535. DOI: 10.3390/ijerph191912200.

Assessment of Electronic Health Record for Cancer Research and Patient Care Through a Scoping Review of Cancer Natural Language Processing.

Wang L, Fu S, Wen A, Ruan X, He H, Liu S JCO Clin Cancer Inform. 2022; 6:e2200006.

PMID: 35917480 PMC: 9470142. DOI: 10.1200/CCI.22.00006.

Research and Application of Artificial Intelligence Based on Electronic Health Records of Patients With Cancer: Systematic Review.

Yang X, Mu D, Peng H, Li H, Wang Y, Wang P JMIR Med Inform. 2022; 10(4):e33799.

PMID: 35442195 PMC: 9069295. DOI: 10.2196/33799.

References

Chapman W, Bridewell W, Hanbury P, Cooper G, Buchanan B . A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform. 2002; 34(5):301-10. DOI: 10.1006/jbin.2001.1029. View

Hripcsak G, Austin J, Alderson P, Friedman C . Use of natural language processing to translate clinical information from a database of 889,921 chest radiographic reports. Radiology. 2002; 224(1):157-63. DOI: 10.1148/radiol.2241011118. View

Denny J, Smithers J, Armstrong B, Spickard 3rd A . "Where do we teach what?" Finding broad concepts in the medical school curriculum. J Gen Intern Med. 2005; 20(10):943-6. PMC: 1490241. DOI: 10.1111/j.1525-1497.2005.0203.x. View

Haug P, Koehler S, Lau L, Wang P, Rocha R, Huff S . Experience with a mixed semantic/syntactic parser. Proc Annu Symp Comput Appl Med Care. 1995; :284-8. PMC: 2579100. View

Meystre S, Savova G, Kipper-Schuler K, Hurdle J . Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inform. 2008; :128-44. View

Kern E, Maney M, Miller D, Tseng C, Tiwari A, Rajan M . Failure of ICD-9-CM codes to identify patients with comorbid chronic kidney disease in diabetes. Health Serv Res. 2006; 41(2):564-80. PMC: 1702507. DOI: 10.1111/j.1475-6773.2005.00482.x. View

Savova G, Masanz J, Ogren P, Zheng J, Sohn S, Kipper-Schuler K . Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc. 2010; 17(5):507-13. PMC: 2995668. DOI: 10.1136/jamia.2009.001560. View

Friedlin J, Overhage M, Al-Haddad M, Waters J, Aguilar-Saavedra J, Kesterson J . Comparing methods for identifying pancreatic cancer patients using electronic data sources. AMIA Annu Symp Proc. 2011; 2010:237-41. PMC: 3041435. View

Jemal A, Thomas A, Murray T, Thun M . Cancer statistics, 2002. CA Cancer J Clin. 2002; 52(1):23-47. DOI: 10.3322/canjclin.52.1.23. View

10.

Li L, Chase H, Patel C, Friedman C, Weng C . Comparing ICD9-encoded diagnoses and NLP-processed discharge summaries for clinical trials pre-screening: a case study. AMIA Annu Symp Proc. 2008; :404-8. PMC: 2656007. View

11.

Aronson A, Lang F . An overview of MetaMap: historical perspective and recent advances. J Am Med Inform Assoc. 2010; 17(3):229-36. PMC: 2995713. DOI: 10.1136/jamia.2009.002733. View

12.

Denny J, Peterson J, Choma N, Xu H, Miller R, Bastarache L . Extracting timing and status descriptors for colonoscopy testing from electronic medical records. J Am Med Inform Assoc. 2010; 17(4):383-8. PMC: 2995656. DOI: 10.1136/jamia.2010.004804. View

13.

Wilson R, Chapman W, Defries S, Becich M, Chapman B . Automated ancillary cancer history classification for mesothelioma patients from free-text clinical reports. J Pathol Inform. 2010; 1:24. PMC: 2956176. DOI: 10.4103/2153-3539.71065. View

14.

Harkema H, Dowling J, Thornblade T, Chapman W . ConText: an algorithm for determining negation, experiencer, and temporal status from clinical reports. J Biomed Inform. 2009; 42(5):839-51. PMC: 2757457. DOI: 10.1016/j.jbi.2009.05.002. View

15.

Denny J, Miller R, Johnson K, Spickard 3rd A . Development and evaluation of a clinical note section header terminology. AMIA Annu Symp Proc. 2008; :156-60. PMC: 2656032. View

16.

Fiszman M, Chapman W, Evans S, Haug P . Automatic identification of pneumonia related concepts on chest x-ray reports. Proc AMIA Symp. 1999; :67-71. PMC: 2232529. View

17.

Haug P, Christensen L, Gundersen M, Clemons B, Koehler S, Bauer K . A natural language parsing system for encoding admitting diagnoses. Proc AMIA Annu Fall Symp. 1997; :814-8. PMC: 2233343. View

18.

Savova G, Fan J, Ye Z, Murphy S, Zheng J, Chute C . Discovering peripheral arterial disease cases from radiology notes using natural language processing. AMIA Annu Symp Proc. 2011; 2010:722-6. PMC: 3041293. View

19.

Penz J, Wilcox A, Hurdle J . Automated identification of adverse events related to central venous catheters. J Biomed Inform. 2006; 40(2):174-82. DOI: 10.1016/j.jbi.2006.06.003. View

20.

Birman-Deych E, Waterman A, Yan Y, Nilasena D, Radford M, Gage B . Accuracy of ICD-9-CM codes for identifying cardiovascular and stroke risk factors. Med Care. 2005; 43(5):480-5. DOI: 10.1097/01.mlr.0000160417.39497.a9. View