Halvade: Scalable Sequence Analysis with MapReduce

Overview

Journal Bioinformatics

Publisher Oxford University Press

Specialty Biology

Date 2015 Mar 31

PMID 25819078

Citations 25

Authors

Dries Decap

Joke Reumers

Charlotte Herzeel

Pascal Costanza

Jan Fostier

Affiliations

Soon will be listed here.

Abstract

Motivation: Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine.

Results: We present Halvade, a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner. As an example, a DNA sequencing analysis pipeline for variant calling has been implemented according to the GATK Best Practices recommendations, supporting both whole genome and whole exome sequencing. Using a 15-node computer cluster with 360 CPU cores in total, Halvade processes the NA12878 dataset (human, 100 bp paired-end reads, 50× coverage) in <3 h with very high parallel efficiency. Even on a single, multi-core machine, Halvade attains a significant speedup compared with running the individual tools with multithreading.

Citing Articles

BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data.

Chen J, Li F, Wang M, Li J, Marquez-Lago T, Leier A Front Big Data. 2022; 4:727216.

PMID: 35118375 PMC: 8805145. DOI: 10.3389/fdata.2021.727216.

Halvade somatic: Somatic variant calling with Apache Spark.

Decap D, de Schaetzen van Brienen L, Larmuseau M, Costanza P, Herzeel C, Wuyts R Gigascience. 2022; 11(1).

PMID: 35022699 PMC: 8756192. DOI: 10.1093/gigascience/giab094.

VC@Scale: Scalable and high-performance variant calling on cluster environments.

Ahmad T, Al Ars Z, Hofstee H Gigascience. 2021; 10(9).

PMID: 34494101 PMC: 8424057. DOI: 10.1093/gigascience/giab057.

Cloud Computing Enabled Big Multi-Omics Data Analytics.

Koppad S, B A, Gkoutos G, Acharjee A Bioinform Biol Insights. 2021; 15:11779322211035921.

PMID: 34376975 PMC: 8323418. DOI: 10.1177/11779322211035921.

Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment.

Maarala A, Arasalo O, Valenzuela D, Makinen V, Heljanko K PLoS One. 2021; 16(8):e0255260.

PMID: 34343181 PMC: 8330939. DOI: 10.1371/journal.pone.0255260.

References

Langmead B, Schatz M, Lin J, Pop M, Salzberg S . Searching for SNPs with cloud computing. Genome Biol. 2009; 10(11):R134. PMC: 3091327. DOI: 10.1186/gb-2009-10-11-r134. View

Niemenmaa M, Kallio A, Schumacher A, Klemela P, Korpelainen E, Heljanko K . Hadoop-BAM: directly manipulating next generation sequencing data in the cloud. Bioinformatics. 2012; 28(6):876-7. PMC: 3307120. DOI: 10.1093/bioinformatics/bts054. View

Fonseca N, Rung J, Brazma A, Marioni J . Tools for mapping high-throughput sequencing data. Bioinformatics. 2012; 28(24):3169-77. DOI: 10.1093/bioinformatics/bts605. View

Zhang J, Chiodini R, Badr A, Zhang G . The impact of next-generation sequencing on genomics. J Genet Genomics. 2011; 38(3):95-109. PMC: 3076108. DOI: 10.1016/j.jgg.2011.02.003. View

Sherry S, Ward M, Kholodov M, Baker J, Phan L, Smigielski E . dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2000; 29(1):308-11. PMC: 29783. DOI: 10.1093/nar/29.1.308. View

Van der Auwera G, Carneiro M, Hartl C, Poplin R, Del Angel G, Levy-Moonshine A . From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics. 2014; 43:11.10.1-11.10.33. PMC: 4243306. DOI: 10.1002/0471250953.bi1110s43. View

McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A . The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010; 20(9):1297-303. PMC: 2928508. DOI: 10.1101/gr.107524.110. View

Puckelwartz M, Pesce L, Nelakuditi V, Dellefave-Castillo L, Golbus J, Day S . Supercomputing for the parallelization of whole genome analysis. Bioinformatics. 2014; 30(11):1508-13. PMC: 4029034. DOI: 10.1093/bioinformatics/btu071. View

Nielsen R, Paul J, Albrechtsen A, Song Y . Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. 2011; 12(6):443-51. PMC: 3593722. DOI: 10.1038/nrg2986. View

10.

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N . The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009; 25(16):2078-9. PMC: 2723002. DOI: 10.1093/bioinformatics/btp352. View

11.

Quinlan A, Hall I . BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010; 26(6):841-2. PMC: 2832824. DOI: 10.1093/bioinformatics/btq033. View

12.

DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C . A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011; 43(5):491-8. PMC: 3083463. DOI: 10.1038/ng.806. View

13.

Schatz M . CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics. 2009; 25(11):1363-9. PMC: 2682523. DOI: 10.1093/bioinformatics/btp236. View

14.

Langmead B, Trapnell C, Pop M, Salzberg S . Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009; 10(3):R25. PMC: 2690996. DOI: 10.1186/gb-2009-10-3-r25. View

15.

Li H, Durbin R . Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009; 25(14):1754-60. PMC: 2705234. DOI: 10.1093/bioinformatics/btp324. View

16.

Li R, Li Y, Kristiansen K, Wang J . SOAP: short oligonucleotide alignment program. Bioinformatics. 2008; 24(5):713-4. DOI: 10.1093/bioinformatics/btn025. View

17.

Pandey R, Schlotterer C . DistMap: a toolkit for distributed short read mapping on a Hadoop cluster. PLoS One. 2013; 8(8):e72614. PMC: 3751911. DOI: 10.1371/journal.pone.0072614. View