» Articles » PMID: 38863802

Combining Phenotypic and Genomic Data to Improve Prediction of Binary Traits

Overview
Journal J Appl Stat
Specialty Public Health
Date 2024 Jun 12
PMID 38863802
Authors
Affiliations
Soon will be listed here.
Abstract

Plant breeders want to develop cultivars that outperform existing genotypes. Some characteristics (here 'main traits') of these cultivars are categorical and difficult to measure directly. It is important to predict the main trait of newly developed genotypes accurately. In addition to marker data, breeding programs often have information on secondary traits (or 'phenotypes') that are easy to measure. Our goal is to improve prediction of main traits with interpretable relations by combining the two data types using variable selection techniques. However, the genomic characteristics can overwhelm the set of secondary traits, so a standard technique may fail to select any phenotypic variables. We develop a new statistical technique that ensures appropriate representation from both the secondary traits and the genotypic variables for optimal prediction. When two data types (markers and secondary traits) are available, we achieve improved prediction of a binary trait by two steps that are designed to ensure that a significant intrinsic effect of a phenotype is incorporated in the relation before accounting for extra effects of genotypes. First, we sparsely regress the secondary traits on the markers and replace the secondary traits by their residuals to obtain the effects of phenotypic variables as adjusted by the genotypic variables. Then, we develop a sparse logistic classifier using the markers and residuals so that the adjusted phenotypes may be selected first to avoid being overwhelmed by the genotypic variables due to their numerical advantage. This classifier uses forward selection aided by a penalty term and can be computed effectively by a technique called the one-pass method. It compares favorably with other classifiers on simulated and real data.

Citing Articles

Integrating and optimizing genomic, weather, and secondary trait data for multiclass classification.

Manthena V, Jarquin D, Howard R Front Genet. 2023; 13:1032691.

PMID: 37065625 PMC: 10090538. DOI: 10.3389/fgene.2022.1032691.

References
1.
Friedman J, Hastie T, Tibshirani R . Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw. 2010; 33(1):1-22. PMC: 2929880. View

2.
Liang Z, Qiu Y, Schnable J . Genome-Phenome Wide Association in Maize and Arabidopsis Identifies a Common Molecular and Evolutionary Signature. Mol Plant. 2020; 13(6):907-922. DOI: 10.1016/j.molp.2020.03.003. View

3.
Kim D, Joung J, Sohn K, Shin H, Park Y, Ritchie M . Knowledge boosting: a graph-based integration approach with multi-omics data and genomic knowledge for cancer clinical outcome prediction. J Am Med Inform Assoc. 2014; 22(1):109-20. PMC: 4433357. DOI: 10.1136/amiajnl-2013-002481. View

4.
Diers B, Specht J, Rainey K, Cregan P, Song Q, Ramasubramanian V . Genetic Architecture of Soybean Yield and Agronomic Traits. G3 (Bethesda). 2018; 8(10):3367-3375. PMC: 6169381. DOI: 10.1534/g3.118.200332. View

5.
An Y, Tang K, Wang J . Time-Aware Multi-Type Data Fusion Representation Learning Framework for Risk Prediction of Cardiovascular Diseases. IEEE/ACM Trans Comput Biol Bioinform. 2021; PP. DOI: 10.1109/TCBB.2021.3118418. View