» Articles » PMID: 24021384

BioPig: a Hadoop-based Analytic Toolkit for Large-scale Sequence Data

Overview
Journal Bioinformatics
Specialty Biology
Date 2013 Sep 12
PMID 24021384
Citations 24
Authors
Affiliations
Soon will be listed here.
Abstract

Motivation: The recent revolution in sequencing technologies has led to an exponential growth of sequence data. As a result, most of the current bioinformatics tools become obsolete as they fail to scale with data. To tackle this 'data deluge', here we introduce the BioPig sequence analysis toolkit as one of the solutions that scale to data and computation.

Results: We built BioPig on the Apache's Hadoop MapReduce system and the Pig data flow language. Compared with traditional serial and MPI-based algorithms, BioPig has three major advantages: first, BioPig's programmability greatly reduces development time for parallel bioinformatics applications; second, testing BioPig with up to 500 Gb sequences demonstrates that it scales automatically with size of data; and finally, BioPig can be ported without modification on many Hadoop infrastructures, as tested with Magellan system at National Energy Research Scientific Computing Center and the Amazon Elastic Compute Cloud. In summary, BioPig represents a novel program framework with the potential to greatly accelerate data-intensive bioinformatics analysis.

Citing Articles

RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor.

Pallotta S, Cascianelli S, Masseroli M BMC Bioinformatics. 2022; 23(1):123.

PMID: 35392801 PMC: 8991469. DOI: 10.1186/s12859-022-04648-4.


BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data.

Chen J, Li F, Wang M, Li J, Marquez-Lago T, Leier A Front Big Data. 2022; 4:727216.

PMID: 35118375 PMC: 8805145. DOI: 10.3389/fdata.2021.727216.


Cloud Computing Enabled Big Multi-Omics Data Analytics.

Koppad S, B A, Gkoutos G, Acharjee A Bioinform Biol Insights. 2021; 15:11779322211035921.

PMID: 34376975 PMC: 8323418. DOI: 10.1177/11779322211035921.


Computational Strategies for Scalable Genomics Analysis.

Shi L, Wang Z Genes (Basel). 2019; 10(12).

PMID: 31817630 PMC: 6947637. DOI: 10.3390/genes10121017.


PipeMEM: A Framework to Speed Up BWA-MEM in Spark with Low Overhead.

Zhang L, Liu C, Dong S Genes (Basel). 2019; 10(11).

PMID: 31689965 PMC: 6896194. DOI: 10.3390/genes10110886.