CARE 2.0: Reducing False-positive Sequencing Error Corrections Using Machine Learning

Overview

Journal BMC Bioinformatics

Publisher Biomed Central

Specialty Biology

Date 2022 Jun 13

PMID 35698033

Authors

Felix Kallenborn

Julian Cascitti

Bertil Schmidt

Affiliations

Soon will be listed here.

Abstract

Background: Next-generation sequencing pipelines often perform error correction as a preprocessing step to obtain cleaned input data. State-of-the-art error correction programs are able to reliably detect and correct the majority of sequencing errors. However, they also introduce new errors by making false-positive corrections. These correction mistakes can have negative impact on downstream analysis, such as k-mer statistics, de-novo assembly, and variant calling. This motivates the need for more precise error correction tools.

Results: We present CARE 2.0, a context-aware read error correction tool based on multiple sequence alignment targeting Illumina datasets. In addition to a number of newly introduced optimizations its most significant change is the replacement of CARE 1.0's hand-crafted correction conditions with a novel classifier based on random decision forests trained on Illumina data. This results in up to two orders-of-magnitude fewer false-positive corrections compared to other state-of-the-art error correction software. At the same time, CARE 2.0 is able to achieve high numbers of true-positive corrections comparable to its competitors. On a simulated full human dataset with 914M reads CARE 2.0 generates only 1.2M false positives (FPs) (and 801.4M true positives (TPs)) at a highly competitive runtime while the best corrections achieved by other state-of-the-art tools contain at least 3.9M FPs and at most 814.5M TPs. Better de-novo assembly and improved k-mer analysis show the applicability of CARE 2.0 to real-world data.

Conclusion: False-positive corrections can negatively influence down-stream analysis. The precision of CARE 2.0 greatly reduces the number of those corrections compared to other state-of-the-art programs including BFC, Karect, Musket, Bcool, SGA, and Lighter. Thus, higher-quality datasets are produced which improve k-mer analysis and de-novo assembly in real-world datasets which demonstrates the applicability of machine learning techniques in the context of sequencing read error correction. CARE 2.0 is written in C++/CUDA for Linux systems and can be run on the CPU as well as on CUDA-enabled GPUs. It is available at https://github.com/fkallen/CARE .

Citing Articles

Enhancing Clinical Applications by Evaluation of Sensitivity and Specificity in Whole Exome Sequencing.

Moon Y, Hong C, Kim Y, Kim J, Ye S, Kang E Int J Mol Sci. 2025; 25(24.

PMID: 39769013 PMC: 11678496. DOI: 10.3390/ijms252413250.

A survey of k-mer methods and applications in bioinformatics.

Moeckel C, Mareboina M, Konnaris M, Chan C, Mouratidis I, Montgomery A Comput Struct Biotechnol J. 2024; 23:2289-2303.

PMID: 38840832 PMC: 11152613. DOI: 10.1016/j.csbj.2024.05.025.

CAREx: context-aware read extension of paired-end sequencing data.

Kallenborn F, Schmidt B BMC Bioinformatics. 2024; 25(1):186.

PMID: 38730374 PMC: 11088031. DOI: 10.1186/s12859-024-05802-w.

MAC-ErrorReads: machine learning-assisted classifier for filtering erroneous NGS reads.

Sami A, El-Metwally S, Rashad M BMC Bioinformatics. 2024; 25(1):61.

PMID: 38321434 PMC: 10848413. DOI: 10.1186/s12859-024-05681-1.

Illumina reads correction: evaluation and improvements.

Dlugosz M, Deorowicz S Sci Rep. 2024; 14(1):2232.

PMID: 38278837 PMC: 11222498. DOI: 10.1038/s41598-024-52386-9.

References

Limasset A, Flot J, Peterlongo P . Toward perfect reads: self-correction of short reads via mapping on de Bruijn graphs. Bioinformatics. 2019; 36(5):1374-1381. DOI: 10.1093/bioinformatics/btz102. View

Kao W, Chan A, Song Y . ECHO: a reference-free short-read error correction algorithm. Genome Res. 2011; 21(7):1181-92. PMC: 3129260. DOI: 10.1101/gr.111351.110. View

Song L, Florea L, Langmead B . Lighter: fast and memory-efficient sequencing error correction without counting. Genome Biol. 2014; 15(11):509. PMC: 4248469. DOI: 10.1186/s13059-014-0509-9. View

Heydari M, Miclotte G, Demeester P, Van de Peer Y, Fostier J . Evaluation of the impact of Illumina error correction tools on de novo genome assembly. BMC Bioinformatics. 2017; 18(1):374. PMC: 5563063. DOI: 10.1186/s12859-017-1784-8. View

Huang W, Li L, Myers J, Marth G . ART: a next-generation sequencing read simulator. Bioinformatics. 2011; 28(4):593-4. PMC: 3278762. DOI: 10.1093/bioinformatics/btr708. View

Dlugosz M, Deorowicz S . RECKONER: read error corrector based on KMC. Bioinformatics. 2017; 33(7):1086-1089. DOI: 10.1093/bioinformatics/btw746. View

Heo Y, Ramachandran A, Hwu W, Ma J, Chen D . BLESS 2: accurate, memory-efficient and fast error correction method. Bioinformatics. 2016; 32(15):2369-71. PMC: 6280799. DOI: 10.1093/bioinformatics/btw146. View

Xin H, Greth J, Emmons J, Pekhimenko G, Kingsford C, Alkan C . Shifted Hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping. Bioinformatics. 2015; 31(10):1553-60. PMC: 4426831. DOI: 10.1093/bioinformatics/btu856. View

Abdallah M, Mahgoub A, Ahmed H, Chaterji S . Athena: Automated Tuning of k-mer based Genomic Error Correction Algorithms using Language Models. Sci Rep. 2019; 9(1):16157. PMC: 6834855. DOI: 10.1038/s41598-019-52196-4. View

10.

Salmela L, Schroder J . Correcting errors in short reads by multiple alignments. Bioinformatics. 2011; 27(11):1455-61. DOI: 10.1093/bioinformatics/btr170. View

11.

Bankevich A, Nurk S, Antipov D, Gurevich A, Dvorkin M, Kulikov A . SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. 2012; 19(5):455-77. PMC: 3342519. DOI: 10.1089/cmb.2012.0021. View

12.

Sharma A, Jain P, Mahgoub A, Zhou Z, Mahadik K, Chaterji S . Lerna: transformer architectures for configuring error correction tools for short- and long-read genome sequencing. BMC Bioinformatics. 2022; 23(1):25. PMC: 8734100. DOI: 10.1186/s12859-021-04547-0. View

13.

Kallenborn F, Hildebrandt A, Schmidt B . CARE: context-aware sequencing read error correction. Bioinformatics. 2020; 37(7):889-895. DOI: 10.1093/bioinformatics/btaa738. View

14.

Marcais G, Kingsford C . A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011; 27(6):764-70. PMC: 3051319. DOI: 10.1093/bioinformatics/btr011. View

15.

Liu Y, Schroder J, Schmidt B . Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data. Bioinformatics. 2012; 29(3):308-15. DOI: 10.1093/bioinformatics/bts690. View

16.

Fischer-Hwang I, Ochoa I, Weissman T, Hernaez M . Denoising of Aligned Genomic Data. Sci Rep. 2019; 9(1):15067. PMC: 6803637. DOI: 10.1038/s41598-019-51418-z. View

17.

Schulz M, Weese D, Holtgrewe M, Dimitrova V, Niu S, Reinert K . Fiona: a parallel and automatic strategy for read error correction. Bioinformatics. 2014; 30(17):i356-63. PMC: 4147893. DOI: 10.1093/bioinformatics/btu440. View

18.

Allam A, Kalnis P, Solovyev V . Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data. Bioinformatics. 2015; 31(21):3421-8. DOI: 10.1093/bioinformatics/btv415. View

19.

Greenfield P, Duesing K, Papanicolaou A, Bauer D . Blue: correcting sequencing errors using consensus and context. Bioinformatics. 2014; 30(19):2723-32. DOI: 10.1093/bioinformatics/btu368. View

20.

Simpson J, Durbin R . Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 2011; 22(3):549-56. PMC: 3290790. DOI: 10.1101/gr.126953.111. View