» Articles » PMID: 37940654

EASTR: Identifying and Eliminating Systematic Alignment Errors in Multi-exon Genes

Overview
Journal Nat Commun
Specialty Biology
Date 2023 Nov 8
PMID 37940654
Authors
Affiliations
Soon will be listed here.
Abstract

Accurate alignment of transcribed RNA to reference genomes is a critical step in the analysis of gene expression, which in turn has broad applications in biomedical research and in the basic sciences. We reveal that widely used splice-aware aligners, such as STAR and HISAT2, can introduce erroneous spliced alignments between repeated sequences, leading to the inclusion of falsely spliced transcripts in RNA-seq experiments. In some cases, the 'phantom' introns resulting from these errors make their way into widely-used genome annotation databases. To address this issue, we present EASTR (Emending Alignments of Spliced Transcript Reads), a software tool that detects and removes falsely spliced alignments or transcripts from alignment and annotation files. EASTR improves the accuracy of spliced alignments across diverse species, including human, maize, and Arabidopsis thaliana, by detecting sequence similarity between intron-flanking regions. We demonstrate that applying EASTR before transcript assembly substantially reduces false positive introns, exons, and transcripts, improving the overall accuracy of assembled transcripts. Additionally, we show that EASTR's application to reference annotation databases can detect and correct likely cases of mis-annotated transcripts.

Citing Articles

Single-Cell RNA-Sequencing Analysis Provides Insights into IgA Nephropathy.

Xia M, Li Y, Liu Y, Dong Z, Liu H Biomolecules. 2025; 15(2).

PMID: 40001494 PMC: 11853383. DOI: 10.3390/biom15020191.

References
1.
Funakoshi K, Bagheri M, Zhou M, Suzuki R, Abe H, Akashi H . Highly sensitive and specific Alu-based quantification of human cells among rodent cells. Sci Rep. 2017; 7(1):13202. PMC: 5643497. DOI: 10.1038/s41598-017-13402-3. View

2.
OLeary N, Wright M, Brister J, Ciufo S, Haddad D, McVeigh R . Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2015; 44(D1):D733-45. PMC: 4702849. DOI: 10.1093/nar/gkv1189. View

3.
Frankish A, Carbonell-Sala S, Diekhans M, Jungreis I, Loveland J, Mudge J . GENCODE: reference annotation for the human and mouse genomes in 2023. Nucleic Acids Res. 2022; 51(D1):D942-D949. PMC: 9825462. DOI: 10.1093/nar/gkac1071. View

4.
Kim D, Langmead B, Salzberg S . HISAT: a fast spliced aligner with low memory requirements. Nat Methods. 2015; 12(4):357-60. PMC: 4655817. DOI: 10.1038/nmeth.3317. View

5.
Price A, Hwang T, Tao R, Burke E, Rajpurohit A, Shin J . Characterizing the nuclear and cytoplasmic transcriptomes in developing and mature human cortex uncovers new insight into psychiatric disease gene regulation. Genome Res. 2019; 30(1):1-11. PMC: 6961577. DOI: 10.1101/gr.250217.119. View