» Articles » PMID: 22195171

Document Clustering of Clinical Narratives: a Systematic Study of Clinical Sublanguages

Overview
Date 2011 Dec 24
PMID 22195171
Citations 20
Authors
Affiliations
Soon will be listed here.
Abstract

It is widely believed that different clinical domains use their own sublanguage in clinical notes, complicating natural language processing, but this has never been demonstrated on a broad selection of note types. Starting from formal sublanguage theory, we constructed a feature space based on vocabulary and semantic types used in 17 different clinical domains by three author types (physicians, nurses, and social workers) in both the in- and outpatient settings. We supplied the resulting vectors to CLUTO, a robust clustering tool suitable for this high-dimensional space. Our results confirm that note types with a broad clinical scope, e.g, History & Physicals and Discharge Summaries, cluster together, while note types with a narrow clinical scope form surprisingly pure, disjoint sublanguages. A reasonable conclusion from this study is that any tool relying on term statistics or semantics trained on one clinical note type may not work well on any other.

Citing Articles

Large Language Model Approach for Zero-Shot Information Extraction and Clustering of Japanese Radiology Reports: Algorithm Development and Validation.

Yamagishi Y, Nakamura Y, Hanaoka S, Abe O JMIR Cancer. 2025; 11:e57275.

PMID: 39864093 PMC: 11867198. DOI: 10.2196/57275.


Contextual Variation of Clinical Notes induced by EHR Migration.

Miller K, Moon S, Fu S, Liu H AMIA Annu Symp Proc. 2024; 2023:1155-1164.

PMID: 38222426 PMC: 10785835.


Generalization of finetuned transformer language models to new clinical contexts.

Xie K, Terman S, Gallagher R, Hill C, Davis K, Litt B JAMIA Open. 2023; 6(3):ooad070.

PMID: 37600072 PMC: 10432353. DOI: 10.1093/jamiaopen/ooad070.


Development and evaluation of an interoperable natural language processing system for identifying pneumonia across clinical settings of care and institutions.

Chapman A, Peterson K, Rutter E, Nevers M, Zhang M, Ying J JAMIA Open. 2023; 5(4):ooac114.

PMID: 36601365 PMC: 9801965. DOI: 10.1093/jamiaopen/ooac114.


Collecting specialty-related medical terms: Development and evaluation of a resource for Spanish.

Lopez-Ubeda P, Pomares-Quimbaya A, Diaz-Galiano M, Schulz S BMC Med Inform Decis Mak. 2021; 21(1):145.

PMID: 33947365 PMC: 8094531. DOI: 10.1186/s12911-021-01495-w.


References
1.
Campbell D, Johnson S . Comparing syntactic complexity in medical and non-medical corpora. Proc AMIA Symp. 2002; :90-4. PMC: 2243419. View

2.
Stetson P, Johnson S, Scotch M, Hripcsak G . The sublanguage of cross-coverage. Proc AMIA Symp. 2002; :742-6. PMC: 2244148. View

3.
Meystre S, Savova G, Kipper-Schuler K, Hurdle J . Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inform. 2008; :128-44. View

4.
Hyun S, Johnson S, Bakken S . Exploring the ability of natural language processing to extract data from nursing narratives. Comput Inform Nurs. 2009; 27(4):215-23. PMC: 4415266. DOI: 10.1097/NCN.0b013e3181a91b58. View

5.
Savova G, Masanz J, Ogren P, Zheng J, Sohn S, Kipper-Schuler K . Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc. 2010; 17(5):507-13. PMC: 2995668. DOI: 10.1136/jamia.2009.001560. View