» Articles » PMID: 36855134

Precision Information Extraction for Rare Disease Epidemiology at Scale

Overview
Journal J Transl Med
Publisher Biomed Central
Date 2023 Mar 1
PMID 36855134
Authors
Affiliations
Soon will be listed here.
Abstract

Background: The United Nations recently made a call to address the challenges of an estimated 300 million persons worldwide living with a rare disease through the collection, analysis, and dissemination of disaggregated data. Epidemiologic Information (EI) regarding prevalence and incidence data of rare diseases is sparse and current paradigms of identifying, extracting, and curating EI rely upon time-intensive, error-prone manual processes. With these limitations, a clear understanding of the variation in epidemiology and outcomes for rare disease patients is hampered. This challenges the public health of rare diseases patients through a lack of information necessary to prioritize research, policy decisions, therapeutic development, and health system allocations.

Methods: In this study, we developed a newly curated epidemiology corpus for Named Entity Recognition (NER), a deep learning framework, and a novel rare disease epidemiologic information pipeline named EpiPipeline4RD consisting of a web interface and Restful API. For the corpus creation, we programmatically gathered a representative sample of rare disease epidemiologic abstracts, utilized weakly-supervised machine learning techniques to label the dataset, and manually validated the labeled dataset. For the deep learning framework development, we fine-tuned our dataset and adapted the BioBERT model for NER. We measured the performance of our BioBERT model for epidemiology entity recognition quantitatively with precision, recall, and F1 and qualitatively through a comparison with Orphanet. We demonstrated the ability for our pipeline to gather, identify, and extract epidemiology information from rare disease abstracts through three case studies.

Results: We developed a deep learning model to extract EI with overall F1 scores of 0.817 and 0.878, evaluated at the entity-level and token-level respectively, and which achieved comparable qualitative results to Orphanet's collection paradigm. Additionally, case studies of the rare diseases Classic homocystinuria, GRACILE syndrome, Phenylketonuria demonstrated the adequate recall of abstracts with epidemiology information, high precision of epidemiology information extraction through our deep learning model, and the increased efficiency of EpiPipeline4RD compared to a manual curation paradigm.

Conclusions: EpiPipeline4RD demonstrated high performance of EI extraction from rare disease literature to augment manual curation processes. This automated information curation paradigm will not only effectively empower development of the NIH Genetic and Rare Diseases Information Center (GARD), but also support the public health of the rare disease community.

Citing Articles

Natural language processing in dermatology: A systematic literature review and state of the art.

Paganelli A, Spadafora M, Navarrete-Dechent C, Guida S, Pellacani G, Longo C J Eur Acad Dermatol Venereol. 2024; 38(12):2225-2234.

PMID: 39150311 PMC: 11587683. DOI: 10.1111/jdv.20286.


Strengths and limitations of new artificial intelligence tool for rare disease epidemiology.

Lapidus D J Transl Med. 2023; 21(1):292.

PMID: 37122037 PMC: 10149020. DOI: 10.1186/s12967-023-04152-0.


Correction: Precision information extraction for rare disease epidemiology at scale.

Kariampuzha W, Alyea G, Qu S, Sanjak J, Mathe E, Sid E J Transl Med. 2023; 21(1):291.

PMID: 37120603 PMC: 10149018. DOI: 10.1186/s12967-023-04127-1.

References
1.
Dababneh S, Alsbou M, Taani N, Sharkas G, Ismael R, Maraqa L . Epidemiology of Phenylketonuria Disease in Jordan: Medical and Nutritional Challenges. Children (Basel). 2022; 9(3). PMC: 8947754. DOI: 10.3390/children9030402. View

2.
Maas N, Van Buggenhout G, Hannes F, Thienpont B, Sanlaville D, Kok K . Genotype-phenotype correlation in 21 patients with Wolf-Hirschhorn syndrome using high resolution array comparative genome hybridisation (CGH). J Med Genet. 2007; 45(2):71-80. DOI: 10.1136/jmg.2007.052910. View

3.
Gokhale K, Chandan J, Toulis K, Gkoutos G, Tino P, Nirantharakumar K . Data extraction for epidemiological research (DExtER): a novel tool for automated clinical epidemiology studies. Eur J Epidemiol. 2020; 36(2):165-178. PMC: 7987616. DOI: 10.1007/s10654-020-00677-6. View

4.
Stanarevic Katavic S . Health information behaviour of rare disease patients: seeking, finding and sharing health information. Health Info Libr J. 2019; 36(4):341-356. DOI: 10.1111/hir.12261. View

5.
Koldingsnes W, Nossent H . Epidemiology of Wegener's granulomatosis in northern Norway. Arthritis Rheum. 2000; 43(11):2481-7. DOI: 10.1002/1529-0131(200011)43:11<2481::AID-ANR15>3.0.CO;2-6. View