» Articles » PMID: 25819078

Halvade: Scalable Sequence Analysis with MapReduce

Overview
Journal Bioinformatics
Specialty Biology
Date 2015 Mar 31
PMID 25819078
Citations 25
Authors
Affiliations
Soon will be listed here.
Abstract

Motivation: Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine.

Results: We present Halvade, a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner. As an example, a DNA sequencing analysis pipeline for variant calling has been implemented according to the GATK Best Practices recommendations, supporting both whole genome and whole exome sequencing. Using a 15-node computer cluster with 360 CPU cores in total, Halvade processes the NA12878 dataset (human, 100 bp paired-end reads, 50× coverage) in <3 h with very high parallel efficiency. Even on a single, multi-core machine, Halvade attains a significant speedup compared with running the individual tools with multithreading.

Citing Articles

BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data.

Chen J, Li F, Wang M, Li J, Marquez-Lago T, Leier A Front Big Data. 2022; 4:727216.

PMID: 35118375 PMC: 8805145. DOI: 10.3389/fdata.2021.727216.


Halvade somatic: Somatic variant calling with Apache Spark.

Decap D, de Schaetzen van Brienen L, Larmuseau M, Costanza P, Herzeel C, Wuyts R Gigascience. 2022; 11(1).

PMID: 35022699 PMC: 8756192. DOI: 10.1093/gigascience/giab094.


VC@Scale: Scalable and high-performance variant calling on cluster environments.

Ahmad T, Al Ars Z, Hofstee H Gigascience. 2021; 10(9).

PMID: 34494101 PMC: 8424057. DOI: 10.1093/gigascience/giab057.


Cloud Computing Enabled Big Multi-Omics Data Analytics.

Koppad S, B A, Gkoutos G, Acharjee A Bioinform Biol Insights. 2021; 15:11779322211035921.

PMID: 34376975 PMC: 8323418. DOI: 10.1177/11779322211035921.


Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment.

Maarala A, Arasalo O, Valenzuela D, Makinen V, Heljanko K PLoS One. 2021; 16(8):e0255260.

PMID: 34343181 PMC: 8330939. DOI: 10.1371/journal.pone.0255260.


References
1.
Langmead B, Schatz M, Lin J, Pop M, Salzberg S . Searching for SNPs with cloud computing. Genome Biol. 2009; 10(11):R134. PMC: 3091327. DOI: 10.1186/gb-2009-10-11-r134. View

2.
Niemenmaa M, Kallio A, Schumacher A, Klemela P, Korpelainen E, Heljanko K . Hadoop-BAM: directly manipulating next generation sequencing data in the cloud. Bioinformatics. 2012; 28(6):876-7. PMC: 3307120. DOI: 10.1093/bioinformatics/bts054. View

3.
Fonseca N, Rung J, Brazma A, Marioni J . Tools for mapping high-throughput sequencing data. Bioinformatics. 2012; 28(24):3169-77. DOI: 10.1093/bioinformatics/bts605. View

4.
Zhang J, Chiodini R, Badr A, Zhang G . The impact of next-generation sequencing on genomics. J Genet Genomics. 2011; 38(3):95-109. PMC: 3076108. DOI: 10.1016/j.jgg.2011.02.003. View

5.
Sherry S, Ward M, Kholodov M, Baker J, Phan L, Smigielski E . dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2000; 29(1):308-11. PMC: 29783. DOI: 10.1093/nar/29.1.308. View