» Articles » PMID: 39890780

STICI: Split-Transformer with Integrated Convolutions for Genotype Imputation

Overview
Journal Nat Commun
Date 2025 Jan 31
PMID 39890780
Authors
Affiliations
Soon will be listed here.
Abstract

Despite advances in sequencing technologies, genome-scale datasets often contain missing bases and genomic segments, hindering downstream analyses. Genotype imputation addresses this issue and has been a cornerstone pre-processing step in genetic and genomic studies. Although various methods have been widely adopted for genotype imputation, it remains challenging to impute certain genomic regions and large structural variants. Here, we present a transformer-based framework, named STICI, for accurate genotype imputation. STICI models automatically learn genome-wide patterns of linkage disequilibrium, evidenced by much higher imputation accuracy in regions with highly linked variants. Our imputation results on the human 1000 Genomes Project and non-human genomes show that STICI can achieve high imputation accuracy comparable to the state-of-the-art genotype imputation methods, with the additional capability to impute multi-allelic variants and various types of genetic variants. STICI can be trained for any collection of genomes automatically using self-supervision. Moreover, STICI shows excellent performance without needing any special presuppositions about the underlying patterns in collections of non-human genomes, pointing to adaptability and applications of STICI to impute missing genotypes in any species.

Citing Articles

STICI: Split-Transformer with integrated convolutions for genotype imputation.

Mowlaei M, Li C, Jamialahmadi O, Dias R, Chen J, Jamialahmadi B Nat Commun. 2025; 16(1):1218.

PMID: 39890780 PMC: 11785734. DOI: 10.1038/s41467-025-56273-3.


GENA-LM: a family of open-source foundational DNA language models for long sequences.

Fishman V, Kuratov Y, Shmelev A, Petrov M, Penzar D, Shepelin D Nucleic Acids Res. 2025; 53(2).

PMID: 39817513 PMC: 11734698. DOI: 10.1093/nar/gkae1310.

References
1.
Weir B . Linkage disequilibrium and association mapping. Annu Rev Genomics Hum Genet. 2008; 9:129-42. DOI: 10.1146/annurev.genom.9.081307.164347. View

2.
Sudmant P, Rausch T, Gardner E, Handsaker R, Abyzov A, Huddleston J . An integrated map of structural variation in 2,504 human genomes. Nature. 2015; 526(7571):75-81. PMC: 4617611. DOI: 10.1038/nature15394. View

3.
Song M, Greenbaum J, Luttrell 4th J, Zhou W, Wu C, Luo Z . An autoencoder-based deep learning method for genotype imputation. Front Artif Intell. 2022; 5:1028978. PMC: 9671213. DOI: 10.3389/frai.2022.1028978. View

4.
Chang C, Chow C, Tellier L, Vattikuti S, Purcell S, Lee J . Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015; 4:7. PMC: 4342193. DOI: 10.1186/s13742-015-0047-8. View

5.
Song M, Greenbaum J, Luttrell 4th J, Zhou W, Wu C, Shen H . A Review of Integrative Imputation for Multi-Omics Datasets. Front Genet. 2020; 11:570255. PMC: 7594632. DOI: 10.3389/fgene.2020.570255. View