Hadoop-BAM: Directly Manipulating Next Generation Sequencing Data in the Cloud

Overview

Journal Bioinformatics

Publisher Oxford University Press

Specialty Biology

Date 2012 Feb 4

PMID 22302568

Citations 39

Authors

Matti Niemenmaa

Aleksi Kallio

Andre Schumacher

Petri Klemela

Eija Korpelainen

Keijo Heljanko

Affiliations

Soon will be listed here.

Abstract

Hadoop-BAM is a novel library for the scalable manipulation of aligned next-generation sequencing data in the Hadoop distributed computing framework. It acts as an integration layer between analysis applications and BAM files that are processed using Hadoop. Hadoop-BAM solves the issues related to BAM data access by presenting a convenient API for implementing map and reduce functions that can directly operate on BAM records. It builds on top of the Picard SAM JDK, so tools that rely on the Picard API are expected to be easily convertible to support large-scale distributed processing. In this article we demonstrate the use of Hadoop-BAM by building a coverage summarizing tool for the Chipster genome browser. Our results show that Hadoop offers good scalability, and one should avoid moving data in and out of Hadoop between analysis steps.

Citing Articles

Ten quick tips for bioinformatics analyses using an Apache Spark distributed computing environment.

Chicco D, Ferraro Petrillo U, Cattaneo G PLoS Comput Biol. 2023; 19(7):e1011272.

PMID: 37471333 PMC: 10358940. DOI: 10.1371/journal.pcbi.1011272.

Cloud-native distributed genomic pileup operations.

Wiewiorka M, Szmurlo A, Stankiewicz P, Gambin T Bioinformatics. 2022; 39(1).

PMID: 36515465 PMC: 9848050. DOI: 10.1093/bioinformatics/btac804.

SaAlign: Multiple DNA/RNA sequence alignment and phylogenetic tree construction tool for ultra-large datasets and ultra-long sequences based on suffix array.

Wang Z, Tan J, Long Y, Liu Y, Lei W, Cai J Comput Struct Biotechnol J. 2022; 20:1487-1493.

PMID: 35422971 PMC: 8976100. DOI: 10.1016/j.csbj.2022.03.018.

BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data.

Chen J, Li F, Wang M, Li J, Marquez-Lago T, Leier A Front Big Data. 2022; 4:727216.

PMID: 35118375 PMC: 8805145. DOI: 10.3389/fdata.2021.727216.

Halvade somatic: Somatic variant calling with Apache Spark.

Decap D, de Schaetzen van Brienen L, Larmuseau M, Costanza P, Herzeel C, Wuyts R Gigascience. 2022; 11(1).

PMID: 35022699 PMC: 8756192. DOI: 10.1093/gigascience/giab094.

References

OConnor B, Merriman B, Nelson S . SeqWare Query Engine: storing and searching sequence data in the cloud. BMC Bioinformatics. 2011; 11 Suppl 12:S2. PMC: 3040528. DOI: 10.1186/1471-2105-11-S12-S2. View

McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A . The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010; 20(9):1297-303. PMC: 2928508. DOI: 10.1101/gr.107524.110. View

Pireddu L, Leo S, Zanetti G . SEAL: a distributed short read mapping and duplicate removal tool. Bioinformatics. 2011; 27(15):2159-60. PMC: 3137215. DOI: 10.1093/bioinformatics/btr325. View

Taylor R . An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinformatics. 2011; 11 Suppl 12:S1. PMC: 3040523. DOI: 10.1186/1471-2105-11-S12-S1. View

Kallio M, Tuimala J, Hupponen T, Klemela P, Gentile M, Scheinin I . Chipster: user-friendly analysis software for microarray and other high-throughput data. BMC Genomics. 2011; 12:507. PMC: 3215701. DOI: 10.1186/1471-2164-12-507. View