» Articles » PMID: 38559266

Genotype Prediction of 336,463 Samples from Public Expression Data

Overview
Journal bioRxiv
Date 2024 Apr 1
PMID 38559266
Authors
Affiliations
Soon will be listed here.
Abstract

Tens of thousands of RNA-sequencing experiments comprising hundreds of thousands of individual samples have now been performed. These data represent a broad range of experimental conditions, sequencing technologies, and hypotheses under study. The Recount project has aggregated and uniformly processed hundreds of thousands of publicly available RNA-seq samples. Most of these samples only include RNA expression measurements; genotype data for these same samples would enable a wide range of analyses including variant prioritization, eQTL analysis, and studies of allele specific expression. Here, we developed a statistical model based on the existing reference and alternative read counts from the RNA-seq experiments available through Recount3 to predict genotypes at autosomal biallelic loci in coding regions. We demonstrate the accuracy of our model using large-scale studies that measured both gene expression and genotype genome-wide. We show that our predictive model is highly accurate with 99.5% overall accuracy, 99.6% major allele accuracy, and 90.4% minor allele accuracy. Our model is robust to tissue and study effects, provided the coverage is high enough. We applied this model to genotype all the samples in Recount 3 and provide the largest ready-to-use expression repository containing genotype information. We illustrate that the predicted genotype from RNA-seq data is sufficient to unravel the underlying population structure of samples in Recount3 using Principal Component Analysis.

References
1.
Fitipaldi H, Franks P . Ethnic, gender and other sociodemographic biases in genome-wide association studies for the most burdensome non-communicable diseases: 2005-2022. Hum Mol Genet. 2022; 32(3):520-532. PMC: 9851743. DOI: 10.1093/hmg/ddac245. View

2.
Mailman M, Feolo M, Jin Y, Kimura M, Tryka K, Bagoutdinov R . The NCBI dbGaP database of genotypes and phenotypes. Nat Genet. 2007; 39(10):1181-6. PMC: 2031016. DOI: 10.1038/ng1007-1181. View

3.
Deelen P, Zhernakova D, de Haan M, van der Sijde M, Bonder M, Karjalainen J . Calling genotypes from public RNA-sequencing data enables identification of genetic variants that affect gene-expression levels. Genome Med. 2015; 7(1):30. PMC: 4423486. DOI: 10.1186/s13073-015-0152-4. View

4.
Wilks C, Zheng S, Chen F, Charles R, Solomon B, Ling J . recount3: summaries and queries for large-scale RNA-seq expression and splicing. Genome Biol. 2021; 22(1):323. PMC: 8628444. DOI: 10.1186/s13059-021-02533-6. View

5.
Benson D, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman D, Ostell J . GenBank. Nucleic Acids Res. 2016; 45(D1):D37-D42. PMC: 5210553. DOI: 10.1093/nar/gkw1070. View