Identifying Multi-resolution Clusters of Diseases in Ten Million Patients with Multimorbidity in Primary Care in England
Overview
Authors
Affiliations
Background: Identifying clusters of diseases may aid understanding of shared aetiology, management of co-morbidities, and the discovery of new disease associations. Our study aims to identify disease clusters using a large set of long-term conditions and comparing methods that use the co-occurrence of diseases versus methods that use the sequence of disease development in a person over time.
Methods: We use electronic health records from over ten million people with multimorbidity registered to primary care in England. First, we extract data-driven representations of 212 diseases from patient records employing (i) co-occurrence-based methods and (ii) sequence-based natural language processing methods. Second, we apply the graph-based Markov Multiscale Community Detection (MMCD) to identify clusters based on disease similarity at multiple resolutions. We evaluate the representations and clusters using a clinically curated set of 253 known disease association pairs, and qualitatively assess the interpretability of the clusters.
Results: Both co-occurrence and sequence-based algorithms generate interpretable disease representations, with the best performance from the skip-gram algorithm. MMCD outperforms k-means and hierarchical clustering in explaining known disease associations. We find that diseases display an almost-hierarchical structure across resolutions from closely to more loosely similar co-occurrence patterns and identify interpretable clusters corresponding to both established and novel patterns.
Conclusions: Our method provides a tool for clustering diseases at different levels of resolution from co-occurrence patterns in high-dimensional electronic health records, which could be used to facilitate discovery of associations between diseases in the future.
Beaney T, Jha S, Alaa A, Smith A, Clarke J, Woodcock T J Am Med Inform Assoc. 2024; 31(7):1451-1462.
PMID: 38719204 PMC: 11187492. DOI: 10.1093/jamia/ocae091.
Beaney T, Clarke J, Salman D, Woodcock T, Majeed A, Barahona M J Multimorb Comorb. 2024; 14:26335565241247430.
PMID: 38638408 PMC: 11025432. DOI: 10.1177/26335565241247430.