A Large-Scale and Serverless Computational Approach for Improving Quality of NGS Data Supporting Big Multi-Omics Data Analyses

Overview

Journal Front Genet

Date 2021 Jul 30

PMID 34326863

Citations 1

Authors

Dariusz Mrozek

Krzysztof Stepien

Piotr Grzesik

Bozena Malysiak-Mrozek

Affiliations

Soon will be listed here.

Abstract

Various types of analyses performed over multi-omics data are driven today by next-generation sequencing (NGS) techniques that produce large volumes of DNA/RNA sequences. Although many tools allow for parallel processing of NGS data in a Big Data distributed environment, they do not facilitate the improvement of the quality of NGS data for a large scale in a simple declarative manner. Meanwhile, large sequencing projects and routine DNA/RNA sequencing associated with molecular profiling of diseases for personalized treatment require both good quality data and appropriate infrastructure for efficient storing and processing of the data. To solve the problems, we adapt the concept of Data Lake for storing and processing big NGS data. We also propose a dedicated library that allows cleaning the DNA/RNA sequences obtained with single-read and paired-end sequencing techniques. To accommodate the growth of NGS data, our solution is largely scalable on the Cloud and may rapidly and flexibly adjust to the amount of data that should be processed. Moreover, to simplify the utilization of the data cleaning methods and implementation of other phases of data analysis workflows, our library extends the declarative U-SQL query language providing a set of capabilities for data extraction, processing, and storing. The results of our experiments prove that the whole solution supports requirements for ample storage and highly parallel, scalable processing that accompanies NGS-based multi-omics data analyses.

Citing Articles

Improved meta-analysis pipeline ameliorates distinctive gene regulators of diabetic vasculopathy in human endothelial cell (hECs) RNA-Seq data.

Pandey D, Perumal P O PLoS One. 2023; 18(11):e0293939.

PMID: 37943808 PMC: 10635490. DOI: 10.1371/journal.pone.0293939.

References

Schubert M, Lindgreen S, Orlando L . AdapterRemoval v2: rapid adapter trimming, identification, and read merging. BMC Res Notes. 2016; 9:88. PMC: 4751634. DOI: 10.1186/s13104-016-1900-2. View

Sturm M, Schroeder C, Bauer P . SeqPurge: highly-sensitive adapter trimming for paired-end NGS data. BMC Bioinformatics. 2016; 17:208. PMC: 4862148. DOI: 10.1186/s12859-016-1069-7. View

Smeds L, Kunstner A . ConDeTri--a content dependent read trimmer for Illumina data. PLoS One. 2011; 6(10):e26314. PMC: 3198461. DOI: 10.1371/journal.pone.0026314. View

Masseroli M, Pinoli P, Venco F, Kaitoua A, Jalili V, Palluzzi F . GenoMetric Query Language: a novel approach to large-scale genomic data management. Bioinformatics. 2015; 31(12):1881-8. DOI: 10.1093/bioinformatics/btv048. View

Bacci G, Bazzicalupo M, Benedetti A, Mengoni A . StreamingTrim 1.0: a Java software for dynamic trimming of 16S rRNA sequence data from metagenetic studies. Mol Ecol Resour. 2013; 14(2):426-34. DOI: 10.1111/1755-0998.12187. View

Ewing B, Green P . Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 1998; 8(3):186-94. View

Schmieder R, Edwards R . Quality control and preprocessing of metagenomic datasets. Bioinformatics. 2011; 27(6):863-4. PMC: 3051327. DOI: 10.1093/bioinformatics/btr026. View

Del Fabbro C, Scalabrin S, Morgante M, Giorgi F . An extensive evaluation of read trimming effects on Illumina NGS data analysis. PLoS One. 2013; 8(12):e85024. PMC: 3871669. DOI: 10.1371/journal.pone.0085024. View

Roehr J, Dieterich C, Reinert K . Flexbar 3.0 - SIMD and multicore parallelization. Bioinformatics. 2017; 33(18):2941-2942. DOI: 10.1093/bioinformatics/btx330. View

10.

Liao X, Li M, Zou Y, Wu F, Pan Y, Wang J . An Efficient Trimming Algorithm based on Multi-Feature Fusion Scoring Model for NGS Data. IEEE/ACM Trans Comput Biol Bioinform. 2019; 17(3):728-738. DOI: 10.1109/TCBB.2019.2897558. View

11.

Lindgreen S . AdapterRemoval: easy cleaning of next-generation sequencing reads. BMC Res Notes. 2012; 5:337. PMC: 3532080. DOI: 10.1186/1756-0500-5-337. View

12.

Masseroli M, Canakoglu A, Pinoli P, Kaitoua A, Gulino A, Horlova O . Processing of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data. Bioinformatics. 2018; 35(5):729-736. DOI: 10.1093/bioinformatics/bty688. View

13.

Hung C, Chen W, Hua G, Zheng H, Tsai S, Lin Y . Cloud computing-based TagSNP selection algorithm for human genome data. Int J Mol Sci. 2015; 16(1):1096-110. PMC: 4307292. DOI: 10.3390/ijms16011096. View

14.

Li Y, Weng J, Hsiao C, Chou M, Tseng C, Hung J . PEAT: an intelligent and efficient paired-end sequencing adapter trimming algorithm. BMC Bioinformatics. 2015; 16 Suppl 1:S2. PMC: 4331701. DOI: 10.1186/1471-2105-16-S1-S2. View

15.

Wiewiorka M, Szmurlo A, Kusmirek W, Gambin T . SeQuiLa-cov: A fast and scalable library for depth of coverage calculations. Gigascience. 2019; 8(8). PMC: 6680061. DOI: 10.1093/gigascience/giz094. View

16.

Bolger A, Lohse M, Usadel B . Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014; 30(15):2114-20. PMC: 4103590. DOI: 10.1093/bioinformatics/btu170. View

17.

Jiang H, Lei R, Ding S, Zhu S . Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads. BMC Bioinformatics. 2014; 15:182. PMC: 4074385. DOI: 10.1186/1471-2105-15-182. View

18.

Dodt M, Roehr J, Ahmed R, Dieterich C . FLEXBAR-Flexible Barcode and Adapter Processing for Next-Generation Sequencing Platforms. Biology (Basel). 2014; 1(3):895-905. PMC: 4009805. DOI: 10.3390/biology1030895. View

19.

NEEDLEMAN S, Wunsch C . A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970; 48(3):443-53. DOI: 10.1016/0022-2836(70)90057-4. View

20.

Kong Y . Btrim: a fast, lightweight adapter and quality trimming program for next-generation sequencing technologies. Genomics. 2011; 98(2):152-3. DOI: 10.1016/j.ygeno.2011.05.009. View