» Articles » PMID: 39678285

DeepCorr: a Novel Error Correction Method for 3GS Long Reads Based on Deep Learning

Overview
Date 2024 Dec 16
PMID 39678285
Authors
Affiliations
Soon will be listed here.
Abstract

Long reads generated by third-generation sequencing (3GS) technologies are involved in many biological analyses and play a vital role due to their ultra-long read length. However, the high error rate affects the downstream process. DeepCorr, a novel error correction algorithm for data from both PacBio and ONT platforms based on deep learning is proposed. The core algorithm adopts a recurrent neural network to capture the long-term dependencies in the long reads to convert the problem of long-read error correction to a multi-classification task. It first aligns the high-precision short reads to long reads to generate the corresponding feature vectors and labels, then feeds these vectors to the neural network, and finally trains the model for prediction and error correction. DeepCorr produces untrimmed corrected long reads and improves the alignment identity while maintaining the length advantage. It can capture and make full use of the dependencies to polish those bases that are not aligned by any short read. DeepCorr achieves better performance than that of the state-of-the-art error correction methods on real-world PacBio and ONT benchmark data sets and consumes fewer computing resources. It is a comprehensive deep learning-based tool that enables one to correct long reads accurately.

References
1.
Godia M, Lian Y, Naval-Sanchez M, Ponte I, Rodriguez-Gil J, Sanchez A . Micrococcal nuclease sequencing of porcine sperm suggests enriched co-location between retained histones and genomic regions related to semen quality and early embryo development. PeerJ. 2023; 11:e15520. PMC: 10290446. DOI: 10.7717/peerj.15520. View

2.
Zou J, Huss M, Abid A, Mohammadi P, Torkamani A, Telenti A . A primer on deep learning in genomics. Nat Genet. 2018; 51(1):12-18. PMC: 11180539. DOI: 10.1038/s41588-018-0295-5. View

3.
Sedlazeck F, Rescheneder P, Smolka M, Fang H, Nattestad M, von Haeseler A . Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods. 2018; 15(6):461-468. PMC: 5990442. DOI: 10.1038/s41592-018-0001-7. View

4.
Hackl T, Hedrich R, Schultz J, Forster F . proovread: large-scale high-accuracy PacBio correction through iterative short read consensus. Bioinformatics. 2014; 30(21):3004-11. PMC: 4609002. DOI: 10.1093/bioinformatics/btu392. View

5.
Cao T, Du Q, Ge R, Li R . Genome-wide identification and characterization of family genes in barley. PeerJ. 2024; 12:e16812. PMC: 10909363. DOI: 10.7717/peerj.16812. View