» Articles » PMID: 35698033

CARE 2.0: Reducing False-positive Sequencing Error Corrections Using Machine Learning

Overview
Publisher Biomed Central
Specialty Biology
Date 2022 Jun 13
PMID 35698033
Authors
Affiliations
Soon will be listed here.
Abstract

Background: Next-generation sequencing pipelines often perform error correction as a preprocessing step to obtain cleaned input data. State-of-the-art error correction programs are able to reliably detect and correct the majority of sequencing errors. However, they also introduce new errors by making false-positive corrections. These correction mistakes can have negative impact on downstream analysis, such as k-mer statistics, de-novo assembly, and variant calling. This motivates the need for more precise error correction tools.

Results: We present CARE 2.0, a context-aware read error correction tool based on multiple sequence alignment targeting Illumina datasets. In addition to a number of newly introduced optimizations its most significant change is the replacement of CARE 1.0's hand-crafted correction conditions with a novel classifier based on random decision forests trained on Illumina data. This results in up to two orders-of-magnitude fewer false-positive corrections compared to other state-of-the-art error correction software. At the same time, CARE 2.0 is able to achieve high numbers of true-positive corrections comparable to its competitors. On a simulated full human dataset with 914M reads CARE 2.0 generates only 1.2M false positives (FPs) (and 801.4M true positives (TPs)) at a highly competitive runtime while the best corrections achieved by other state-of-the-art tools contain at least 3.9M FPs and at most 814.5M TPs. Better de-novo assembly and improved k-mer analysis show the applicability of CARE 2.0 to real-world data.

Conclusion: False-positive corrections can negatively influence down-stream analysis. The precision of CARE 2.0 greatly reduces the number of those corrections compared to other state-of-the-art programs including BFC, Karect, Musket, Bcool, SGA, and Lighter. Thus, higher-quality datasets are produced which improve k-mer analysis and de-novo assembly in real-world datasets which demonstrates the applicability of machine learning techniques in the context of sequencing read error correction. CARE 2.0 is written in C++/CUDA for Linux systems and can be run on the CPU as well as on CUDA-enabled GPUs. It is available at https://github.com/fkallen/CARE .

Citing Articles

Enhancing Clinical Applications by Evaluation of Sensitivity and Specificity in Whole Exome Sequencing.

Moon Y, Hong C, Kim Y, Kim J, Ye S, Kang E Int J Mol Sci. 2025; 25(24.

PMID: 39769013 PMC: 11678496. DOI: 10.3390/ijms252413250.


A survey of k-mer methods and applications in bioinformatics.

Moeckel C, Mareboina M, Konnaris M, Chan C, Mouratidis I, Montgomery A Comput Struct Biotechnol J. 2024; 23:2289-2303.

PMID: 38840832 PMC: 11152613. DOI: 10.1016/j.csbj.2024.05.025.


CAREx: context-aware read extension of paired-end sequencing data.

Kallenborn F, Schmidt B BMC Bioinformatics. 2024; 25(1):186.

PMID: 38730374 PMC: 11088031. DOI: 10.1186/s12859-024-05802-w.


MAC-ErrorReads: machine learning-assisted classifier for filtering erroneous NGS reads.

Sami A, El-Metwally S, Rashad M BMC Bioinformatics. 2024; 25(1):61.

PMID: 38321434 PMC: 10848413. DOI: 10.1186/s12859-024-05681-1.


Illumina reads correction: evaluation and improvements.

Dlugosz M, Deorowicz S Sci Rep. 2024; 14(1):2232.

PMID: 38278837 PMC: 11222498. DOI: 10.1038/s41598-024-52386-9.

References
1.
Limasset A, Flot J, Peterlongo P . Toward perfect reads: self-correction of short reads via mapping on de Bruijn graphs. Bioinformatics. 2019; 36(5):1374-1381. DOI: 10.1093/bioinformatics/btz102. View

2.
Kao W, Chan A, Song Y . ECHO: a reference-free short-read error correction algorithm. Genome Res. 2011; 21(7):1181-92. PMC: 3129260. DOI: 10.1101/gr.111351.110. View

3.
Song L, Florea L, Langmead B . Lighter: fast and memory-efficient sequencing error correction without counting. Genome Biol. 2014; 15(11):509. PMC: 4248469. DOI: 10.1186/s13059-014-0509-9. View

4.
Heydari M, Miclotte G, Demeester P, Van de Peer Y, Fostier J . Evaluation of the impact of Illumina error correction tools on de novo genome assembly. BMC Bioinformatics. 2017; 18(1):374. PMC: 5563063. DOI: 10.1186/s12859-017-1784-8. View

5.
Huang W, Li L, Myers J, Marth G . ART: a next-generation sequencing read simulator. Bioinformatics. 2011; 28(4):593-4. PMC: 3278762. DOI: 10.1093/bioinformatics/btr708. View