Combining Phenotypic and Genomic Data to Improve Prediction of Binary Traits

Overview

Journal J Appl Stat

Specialty Public Health

Date 2024 Jun 12

PMID 38863802

Authors

D Jarquin

A Roy

B Clarke

S Ghosal

Affiliations

Soon will be listed here.

Abstract

Plant breeders want to develop cultivars that outperform existing genotypes. Some characteristics (here 'main traits') of these cultivars are categorical and difficult to measure directly. It is important to predict the main trait of newly developed genotypes accurately. In addition to marker data, breeding programs often have information on secondary traits (or 'phenotypes') that are easy to measure. Our goal is to improve prediction of main traits with interpretable relations by combining the two data types using variable selection techniques. However, the genomic characteristics can overwhelm the set of secondary traits, so a standard technique may fail to select any phenotypic variables. We develop a new statistical technique that ensures appropriate representation from both the secondary traits and the genotypic variables for optimal prediction. When two data types (markers and secondary traits) are available, we achieve improved prediction of a binary trait by two steps that are designed to ensure that a significant intrinsic effect of a phenotype is incorporated in the relation before accounting for extra effects of genotypes. First, we sparsely regress the secondary traits on the markers and replace the secondary traits by their residuals to obtain the effects of phenotypic variables as adjusted by the genotypic variables. Then, we develop a sparse logistic classifier using the markers and residuals so that the adjusted phenotypes may be selected first to avoid being overwhelmed by the genotypic variables due to their numerical advantage. This classifier uses forward selection aided by a penalty term and can be computed effectively by a technique called the one-pass method. It compares favorably with other classifiers on simulated and real data.

Citing Articles

Integrating and optimizing genomic, weather, and secondary trait data for multiclass classification.

Manthena V, Jarquin D, Howard R Front Genet. 2023; 13:1032691.

PMID: 37065625 PMC: 10090538. DOI: 10.3389/fgene.2022.1032691.

References

Friedman J, Hastie T, Tibshirani R . Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw. 2010; 33(1):1-22. PMC: 2929880. View

Liang Z, Qiu Y, Schnable J . Genome-Phenome Wide Association in Maize and Arabidopsis Identifies a Common Molecular and Evolutionary Signature. Mol Plant. 2020; 13(6):907-922. DOI: 10.1016/j.molp.2020.03.003. View

Kim D, Joung J, Sohn K, Shin H, Park Y, Ritchie M . Knowledge boosting: a graph-based integration approach with multi-omics data and genomic knowledge for cancer clinical outcome prediction. J Am Med Inform Assoc. 2014; 22(1):109-20. PMC: 4433357. DOI: 10.1136/amiajnl-2013-002481. View

Diers B, Specht J, Rainey K, Cregan P, Song Q, Ramasubramanian V . Genetic Architecture of Soybean Yield and Agronomic Traits. G3 (Bethesda). 2018; 8(10):3367-3375. PMC: 6169381. DOI: 10.1534/g3.118.200332. View

An Y, Tang K, Wang J . Time-Aware Multi-Type Data Fusion Representation Learning Framework for Risk Prediction of Cardiovascular Diseases. IEEE/ACM Trans Comput Biol Bioinform. 2021; PP. DOI: 10.1109/TCBB.2021.3118418. View

Ritchie M, Holzinger E, Li R, Pendergrass S, Kim D . Methods of integrating data to uncover genotype-phenotype interactions. Nat Rev Genet. 2015; 16(2):85-97. DOI: 10.1038/nrg3868. View

Wu T, Chen Y, Hastie T, Sobel E, Lange K . Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics. 2009; 25(6):714-21. PMC: 2732298. DOI: 10.1093/bioinformatics/btp041. View

Desta Z, Ortiz R . Genomic selection: genome-wide prediction in plant improvement. Trends Plant Sci. 2014; 19(9):592-601. DOI: 10.1016/j.tplants.2014.05.006. View

Vasaikar S, Straub P, Wang J, Zhang B . LinkedOmics: analyzing multi-omics data within and across 32 cancer types. Nucleic Acids Res. 2017; 46(D1):D956-D963. PMC: 5753188. DOI: 10.1093/nar/gkx1090. View

10.

Zhang X, Lin Z, Wang J, Liu H, Zhou L, Zhong S . The tin1 gene retains the function of promoting tillering in maize. Nat Commun. 2019; 10(1):5608. PMC: 6898233. DOI: 10.1038/s41467-019-13425-6. View

11.

Zou H, Zhang H . ON THE ADAPTIVE ELASTIC-NET WITH A DIVERGING NUMBER OF PARAMETERS. Ann Stat. 2010; 37(4):1733-1751. PMC: 2864037. DOI: 10.1214/08-AOS625. View

12.

Matthews B . Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta. 1975; 405(2):442-51. DOI: 10.1016/0005-2795(75)90109-9. View

13.

Moser G, Tier B, Crump R, Khatkar M, Raadsma H . A comparison of five methods to predict genomic breeding values of dairy bulls from genome-wide SNP markers. Genet Sel Evol. 2010; 41:56. PMC: 2814805. DOI: 10.1186/1297-9686-41-56. View

14.

Dahl A, Iotchkova V, Baud A, Johansson A, Gyllensten U, Soranzo N . A multiple-phenotype imputation method for genetic studies. Nat Genet. 2016; 48(4):466-72. PMC: 4817234. DOI: 10.1038/ng.3513. View

15.

Xavier A, Jarquin D, Howard R, Ramasubramanian V, Specht J, Graef G . Genome-Wide Analysis of Grain Yield Stability and Environmental Interactions in a Multiparental Soybean Population. G3 (Bethesda). 2017; 8(2):519-529. PMC: 5919731. DOI: 10.1534/g3.117.300300. View

16.

Wimmer V, Albrecht T, Auinger H, Schon C . synbreed: a framework for the analysis of genomic prediction data using R. Bioinformatics. 2012; 28(15):2086-7. DOI: 10.1093/bioinformatics/bts335. View

17.

Kim D, Li R, Lucas A, Verma S, Dudek S, Ritchie M . Using knowledge-driven genomic interactions for multi-omics data analysis: metadimensional models for predicting clinical outcomes in ovarian carcinoma. J Am Med Inform Assoc. 2017; 24(3):577-587. PMC: 5391734. DOI: 10.1093/jamia/ocw165. View

18.

Jeong S, Kim J, Kim N . GMStool: GWAS-based marker selection tool for genomic prediction from genomic data. Sci Rep. 2020; 10(1):19653. PMC: 7665227. DOI: 10.1038/s41598-020-76759-y. View

19.

Ma W, Qiu Z, Song J, Li J, Cheng Q, Zhai J . A deep convolutional neural network approach for predicting phenotypes from genotypes. Planta. 2018; 248(5):1307-1318. DOI: 10.1007/s00425-018-2976-9. View