» Articles » PMID: 38851759

Biologically Meaningful Genome Interpretation Models to Address Data Underdetermination for the Leaf and Seed Ionome Prediction in Arabidopsis Thaliana

Overview
Journal Sci Rep
Specialty Science
Date 2024 Jun 8
PMID 38851759
Authors
Affiliations
Soon will be listed here.
Abstract

Genome interpretation (GI) encompasses the computational attempts to model the relationship between genotype and phenotype with the goal of understanding how the first leads to the second. While traditional approaches have focused on sub-problems such as predicting the effect of single nucleotide variants or finding genetic associations, recent advances in neural networks (NNs) have made it possible to develop end-to-end GI models that take genomic data as input and predict phenotypes as output. However, technical and modeling issues still need to be fixed for these models to be effective, including the widespread underdetermination of genomic datasets, making them unsuitable for training large, overfitting-prone, NNs. Here we propose novel GI models to address this issue, exploring the use of two types of transfer learning approaches and proposing a novel Biologically Meaningful Sparse NN layer specifically designed for end-to-end GI. Our models predict the leaf and seed ionome in A.thaliana, obtaining comparable results to our previous over-parameterized model while reducing the number of parameters by 8.8 folds. We also investigate how the effect of population stratification influences the evaluation of the performances, highlighting how it leads to (1) an instance of the Simpson's Paradox, and (2) model generalization limitations.

Citing Articles

DAGIP: alleviating cell-free DNA sequencing biases with optimal transport.

Passemiers A, Tuveri S, Jatsenko T, Vanderstichele A, Busschaert P, Coosemans A Genome Biol. 2025; 26(1):49.

PMID: 40055826 PMC: 11887355. DOI: 10.1186/s13059-025-03511-y.

References
1.
Huang X, Salt D . Plant Ionomics: From Elemental Profiling to Environmental Adaptation. Mol Plant. 2016; 9(6):787-97. DOI: 10.1016/j.molp.2016.05.003. View

2.
Runcie D, Qu J, Cheng H, Crawford L . MegaLMM: Mega-scale linear mixed models for genomic predictions with thousands of traits. Genome Biol. 2021; 22(1):213. PMC: 8299638. DOI: 10.1186/s13059-021-02416-w. View

3.
Togninalli M, Seren U, Freudenthal J, Monroe J, Meng D, Nordborg M . AraPheno and the AraGWAS Catalog 2020: a major database update including RNA-Seq and knockout mutation data for Arabidopsis thaliana. Nucleic Acids Res. 2019; 48(D1):D1063-D1068. PMC: 7145550. DOI: 10.1093/nar/gkz925. View

4.
Salt D, Baxter I, Lahner B . Ionomics and the study of the plant ionome. Annu Rev Plant Biol. 2008; 59:709-33. DOI: 10.1146/annurev.arplant.59.032607.092942. View

5.
Tucker G, Price A, Berger B . Improving the power of GWAS and avoiding confounding from population stratification with PC-Select. Genetics. 2014; 197(3):1045-9. PMC: 4096359. DOI: 10.1534/genetics.114.164285. View