» Articles » PMID: 28625880

Predicting Biomedical Metadata in CEDAR: A Study of Gene Expression Omnibus (GEO)

Overview
Journal J Biomed Inform
Publisher Elsevier
Date 2017 Jun 20
PMID 28625880
Citations 8
Authors
Affiliations
Soon will be listed here.
Abstract

A crucial and limiting factor in data reuse is the lack of accurate, structured, and complete descriptions of data, known as metadata. Towards improving the quantity and quality of metadata, we propose a novel metadata prediction framework to learn associations from existing metadata that can be used to predict metadata values. We evaluate our framework in the context of experimental metadata from the Gene Expression Omnibus (GEO). We applied four rule mining algorithms to the most common structured metadata elements (sample type, molecular type, platform, label type and organism) from over 1.3million GEO records. We examined the quality of well supported rules from each algorithm and visualized the dependencies among metadata elements. Finally, we evaluated the performance of the algorithms in terms of accuracy, precision, recall, and F-measure. We found that PART is the best algorithm outperforming Apriori, Predictive Apriori, and Decision Table. All algorithms perform significantly better in predicting class values than the majority vote classifier. We found that the performance of the algorithms is related to the dimensionality of the GEO elements. The average performance of all algorithm increases due of the decreasing of dimensionality of the unique values of these elements (2697 platforms, 537 organisms, 454 labels, 9 molecules, and 5 types). Our work suggests that experimental metadata such as present in GEO can be accurately predicted using rule mining algorithms. Our work has implications for both prospective and retrospective augmentation of metadata quality, which are geared towards making data easier to find and reuse.

Citing Articles

The Effect of Vitamin D Deficiency on Immune-Related Hub Genes: A Network Analysis Associated With Type 1 Diabetes.

Hussein S, Bandarian F, Salehi N, Mosadegh Khah A, Motevaseli E, Azizi Z Cureus. 2024; 16(9):e68611.

PMID: 39371824 PMC: 11452324. DOI: 10.7759/cureus.68611.


Unraveling the roles of gene and immune-metabolic pathways in psoriasis: a bioinformatics exploration for diagnostic markers and therapeutic targets.

Chen G, Chen X, Duan X, Zhang R, Bai C Front Mol Biosci. 2024; 11:1439837.

PMID: 39239353 PMC: 11374644. DOI: 10.3389/fmolb.2024.1439837.


Systematic tissue annotations of genomics samples by modeling unstructured metadata.

Hawkins N, Maldaver M, Yannakopoulos A, Guare L, Krishnan A Nat Commun. 2022; 13(1):6736.

PMID: 36347858 PMC: 9643451. DOI: 10.1038/s41467-022-34435-x.


Maximizing the reusability of gene expression data by predicting missing metadata.

Lung P, Zhong D, Pang X, Li Y, Zhang J PLoS Comput Biol. 2020; 16(11):e1007450.

PMID: 33156882 PMC: 7673503. DOI: 10.1371/journal.pcbi.1007450.


Applying citizen science to gene, drug and disease relationship extraction from biomedical abstracts.

Tsueng G, Nanis M, Fouquier J, Mayers M, Good B, Su A Bioinformatics. 2019; 36(4):1226-1233.

PMID: 31504205 PMC: 8104067. DOI: 10.1093/bioinformatics/btz678.


References
1.
Brossette S, Sprague A, Hardin J, Waites K, Jones W, Moser S . Association rules and data mining in hospital infection control and public health surveillance. J Am Med Inform Assoc. 1998; 5(4):373-81. PMC: 61314. DOI: 10.1136/jamia.1998.0050373. View

2.
Chen T, Chou L, Hwang S . Application of a data-mining technique to analyze coprescription patterns for antacids in Taiwan. Clin Ther. 2003; 25(9):2453-63. DOI: 10.1016/s0149-2918(03)80287-4. View

3.
Panahiazar M, Taslimitehrani V, Pereira N, Pathak J . Using EHRs and Machine Learning for Heart Failure Survival Analysis. Stud Health Technol Inform. 2015; 216:40-4. PMC: 4905764. View

4.
Downs S, Wallace M . Mining association rules from a pediatric primary care decision support system. Proc AMIA Symp. 2000; :200-4. PMC: 2243862. View

5.
Buckberry S, Bent S, Bianco-Miotto T, Roberts C . massiR: a method for predicting the sex of samples in gene expression microarray datasets. Bioinformatics. 2014; 30(14):2084-5. PMC: 4080740. DOI: 10.1093/bioinformatics/btu161. View