Maximizing the Reusability of Gene Expression Data by Predicting Missing Metadata

Overview

Journal PLoS Comput Biol

Specialty Biology

Date 2020 Nov 6

PMID 33156882

Citations 3

Authors

Pei-Yau Lung

Dongrui Zhong

Xiaodong Pang

Yan Li

Jinfeng Zhang

Affiliations

Soon will be listed here.

Abstract

Reusability is part of the FAIR data principle, which aims to make data Findable, Accessible, Interoperable, and Reusable. One of the current efforts to increase the reusability of public genomics data has been to focus on the inclusion of quality metadata associated with the data. When necessary metadata are missing, most researchers will consider the data useless. In this study, we developed a framework to predict the missing metadata of gene expression datasets to maximize their reusability. We found that when using predicted data to conduct other analyses, it is not optimal to use all the predicted data. Instead, one should only use the subset of data, which can be predicted accurately. We proposed a new metric called Proportion of Cases Accurately Predicted (PCAP), which is optimized in our specifically-designed machine learning pipeline. The new approach performed better than pipelines using commonly used metrics such as F1-score in terms of maximizing the reusability of data with missing values. We also found that different variables might need to be predicted using different machine learning methods and/or different data processing protocols. Using differential gene expression analysis as an example, we showed that when missing variables are accurately predicted, the corresponding gene expression data can be reliably used in downstream analyses.

Citing Articles

The pursuit of genetic gain in agricultural crops through the application of machine-learning to genomic prediction.

Jones D, Fornarelli R, Derbyshire M, Gibberd M, Barker K, Hane J Front Genet. 2023; 14:1186782.

PMID: 37614817 PMC: 10443705. DOI: 10.3389/fgene.2023.1186782.

Metadata retrieval from sequence databases with ffq.

Galvez-Merchan A, Min K, Pachter L, Booeshaghi A Bioinformatics. 2023; 39(1).

PMID: 36610997 PMC: 9883619. DOI: 10.1093/bioinformatics/btac667.

Impact of Clinical Data Veracity on Cancer Genomic Research.

Mehta S, Wright D, Black M, Merrie A, Anjomshoaa A, Munro F JNCI Cancer Spectr. 2022; 6(6).

PMID: 36255250 PMC: 9648686. DOI: 10.1093/jncics/pkac070.

References

Bou-Dargham M, Liu Y, Sang Q, Zhang J . Subgrouping breast cancer patients based on immune evasion mechanisms unravels a high involvement of transforming growth factor-beta and decoy receptor 3. PLoS One. 2018; 13(12):e0207799. PMC: 6279052. DOI: 10.1371/journal.pone.0207799. View

Chang J, Wooten E, Tsimelzon A, Hilsenbeck S, Gutierrez M, Elledge R . Gene expression profiling for the prediction of therapeutic response to docetaxel in patients with breast cancer. Lancet. 2003; 362(9381):362-9. DOI: 10.1016/S0140-6736(03)14023-8. View

Panahiazar M, Dumontier M, Gevaert O . Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO). J Biomed Inform. 2017; 72:132-139. PMC: 5643580. DOI: 10.1016/j.jbi.2017.06.017. View

Li Y, Chen L . Big biological data: challenges and opportunities. Genomics Proteomics Bioinformatics. 2014; 12(5):187-9. PMC: 4411415. DOI: 10.1016/j.gpb.2014.10.001. View

Bastani M, Vos L, Asgarian N, Deschenes J, Graham K, Mackey J . A machine learned classifier that uses gene expression data to accurately predict estrogen receptor status. PLoS One. 2013; 8(12):e82144. PMC: 3846850. DOI: 10.1371/journal.pone.0082144. View

Stewart P, Luks J, Roycik M, Sang Q, Zhang J . Differentially expressed transcripts and dysregulated signaling pathways and networks in African American breast cancer. PLoS One. 2013; 8(12):e82460. PMC: 3853650. DOI: 10.1371/journal.pone.0082460. View

Posch L, Panahiazar M, Dumontier M, Gevaert O . Predicting structured metadata from unstructured metadata. Database (Oxford). 2017; 2016. PMC: 4892825. DOI: 10.1093/database/baw080. View

Shi Y, Steppi A, Cao Y, Wang J, He M, Li L . Integrative Comparison of mRNA Expression Patterns in Breast Cancers from Caucasian and Asian Americans with Implications for Precision Medicine. Cancer Res. 2017; 77(2):423-433. PMC: 5243181. DOI: 10.1158/0008-5472.CAN-16-1959. View

Mersha T, Abebe T . Self-reported race/ethnicity in the age of genomic research: its potential impact on understanding health disparities. Hum Genomics. 2015; 9:1. PMC: 4307746. DOI: 10.1186/s40246-014-0023-x. View

10.

Smith J, Deane N, Wu F, Merchant N, Zhang B, Jiang A . Experimentally derived metastasis gene expression profile predicts recurrence and death in patients with colon cancer. Gastroenterology. 2009; 138(3):958-68. PMC: 3388775. DOI: 10.1053/j.gastro.2009.11.005. View

11.

Alyass A, Turcotte M, Meyre D . From big data analysis to personalized medicine for all: challenges and opportunities. BMC Med Genomics. 2015; 8:33. PMC: 4482045. DOI: 10.1186/s12920-015-0108-y. View

12.

Iwao-Koizumi K, Matoba R, Ueno N, Kim S, Ando A, Miyoshi Y . Prediction of docetaxel response in human breast cancer by gene expression profiling. J Clin Oncol. 2005; 23(3):422-31. DOI: 10.1200/JCO.2005.09.078. View

13.

Lee C, Yoon H . Medical big data: promise and challenges. Kidney Res Clin Pract. 2017; 36(1):3-11. PMC: 5331970. DOI: 10.23876/j.krcp.2017.36.1.3. View

14.

Singh D, Febbo P, Ross K, Jackson D, Manola J, Ladd C . Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2002; 1(2):203-9. DOI: 10.1016/s1535-6108(02)00030-2. View

15.

Birney E, Stamatoyannopoulos J, Dutta A, Guigo R, Gingeras T, Margulies E . Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007; 447(7146):799-816. PMC: 2212820. DOI: 10.1038/nature05874. View

16.

Edgar R, Domrachev M, Lash A . Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2001; 30(1):207-10. PMC: 99122. DOI: 10.1093/nar/30.1.207. View

17.

Shedden K, Taylor J, Enkemann S, Tsao M, Yeatman T, Gerald W . Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nat Med. 2008; 14(8):822-7. PMC: 2667337. DOI: 10.1038/nm.1790. View

18.

Bou-Dargham M, Sha L, Sang Q, Zhang J . Immune landscape of human prostate cancer: immune evasion mechanisms and biomarkers for personalized immunotherapy. BMC Cancer. 2020; 20(1):572. PMC: 7302357. DOI: 10.1186/s12885-020-07058-y. View

19.

Szabo A, Boucher K, Carroll W, Klebanov L, Tsodikov A, Yakovlev A . Variable selection and pattern recognition with gene expression data generated by the microarray technology. Math Biosci. 2002; 176(1):71-98. DOI: 10.1016/s0025-5564(01)00103-1. View

20.

Wei J, Greer B, Westermann F, Steinberg S, Son C, Chen Q . Prediction of clinical outcome using gene expression profiling and artificial neural networks for patients with neuroblastoma. Cancer Res. 2004; 64(19):6883-91. PMC: 1298184. DOI: 10.1158/0008-5472.CAN-04-0695. View