Bioinformatics Applications on Apache Spark

Overview

Journal Gigascience

Publisher Oxford University Press

Specialties Biology
Genetics

Date 2018 Aug 14

PMID 30101283

Citations 30

Authors

Runxin Guo

Yi Zhao

Quan Zou

Xiaodong Fang

Shaoliang Peng

Affiliations

Soon will be listed here.

Abstract

With the rapid development of next-generation sequencing technology, ever-increasing quantities of genomic data pose a tremendous challenge to data processing. Therefore, there is an urgent need for highly scalable and powerful computational systems. Among the state-of-the-art parallel computing platforms, Apache Spark is a fast, general-purpose, in-memory, iterative computing framework for large-scale data processing that ensures high fault tolerance and high scalability by introducing the resilient distributed dataset abstraction. In terms of performance, Spark can be up to 100 times faster in terms of memory access and 10 times faster in terms of disk access than Hadoop. Moreover, it provides advanced application programming interfaces in Java, Scala, Python, and R. It also supports some advanced components, including Spark SQL for structured data processing, MLlib for machine learning, GraphX for computing graphs, and Spark Streaming for stream computing. We surveyed Spark-based applications used in next-generation sequencing and other biological domains, such as epigenetics, phylogeny, and drug discovery. The results of this survey are used to provide a comprehensive guideline allowing bioinformatics researchers to apply Spark in their own fields.

Citing Articles

Mechanisms and technologies in cancer epigenetics.

Sherif Z, Ogunwobi O, Ressom H Front Oncol. 2025; 14:1513654.

PMID: 39839798 PMC: 11746123. DOI: 10.3389/fonc.2024.1513654.

Biomedical Big Data Technologies, Applications, and Challenges for Precision Medicine: A Review.

Yang X, Huang K, Yang D, Zhao W, Zhou X Glob Chall. 2024; 8(1):2300163.

PMID: 38223896 PMC: 10784210. DOI: 10.1002/gch2.202300163.

Negation recognition in clinical natural language processing using a combination of the NegEx algorithm and a convolutional neural network.

Arguello-Gonzalez G, Aquino-Esperanza J, Salvador D, Breton-Romero R, Del Rio-Bermudez C, Tello J BMC Med Inform Decis Mak. 2023; 23(1):216.

PMID: 37833661 PMC: 10576331. DOI: 10.1186/s12911-023-02301-5.

Ten quick tips for bioinformatics analyses using an Apache Spark distributed computing environment.

Chicco D, Ferraro Petrillo U, Cattaneo G PLoS Comput Biol. 2023; 19(7):e1011272.

PMID: 37471333 PMC: 10358940. DOI: 10.1371/journal.pcbi.1011272.

Fog-Based Smart Cardiovascular Disease Prediction System Powered by Modified Gated Recurrent Unit.

Nancy A, Ravindran D, Vincent D, Srinivasan K, Chang C Diagnostics (Basel). 2023; 13(12).

PMID: 37370966 PMC: 10297507. DOI: 10.3390/diagnostics13122071.

References

Langmead B, Schatz M, Lin J, Pop M, Salzberg S . Searching for SNPs with cloud computing. Genome Biol. 2009; 10(11):R134. PMC: 3091327. DOI: 10.1186/gb-2009-10-11-r134. View

Miyazawa S . A reliable sequence alignment method based on probabilities of residue correspondences. Protein Eng. 1995; 8(10):999-1009. DOI: 10.1093/protein/8.10.999. View

Meng J, Wang B, Wei Y, Feng S, Balaji P . SWAP-Assembler: scalable and efficient genome assembly towards thousands of cores. BMC Bioinformatics. 2014; 15 Suppl 9:S2. PMC: 4168705. DOI: 10.1186/1471-2105-15-S9-S2. View

Liu K, Warnow T, Holder M, Nelesen S, Yu J, Stamatakis A . SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. Syst Biol. 2011; 61(1):90-106. DOI: 10.1093/sysbio/syr095. View

Abuin J, Pichel J, Pena T, Amigo J . BigBWA: approaching the Burrows-Wheeler aligner to Big Data technologies. Bioinformatics. 2015; 31(24):4003-5. DOI: 10.1093/bioinformatics/btv506. View

Wiewiorka M, Messina A, Pacholewska A, Maffioletti S, Gawrysiak P, Okoniewski M . SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision. Bioinformatics. 2014; 30(18):2652-3. DOI: 10.1093/bioinformatics/btu343. View

Katoh K, Standley D . MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013; 30(4):772-80. PMC: 3603318. DOI: 10.1093/molbev/mst010. View

Nguyen T, Shi W, Ruden D . CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping. BMC Res Notes. 2011; 4:171. PMC: 3127959. DOI: 10.1186/1756-0500-4-171. View

Do C, Mahabhashyam M, Brudno M, Batzoglou S . ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res. 2005; 15(2):330-40. PMC: 546535. DOI: 10.1101/gr.2821705. View

10.

Yang A, Troup M, Lin P, Ho J . Falco: a quick and flexible single-cell RNA-seq processing framework on the cloud. Bioinformatics. 2016; 33(5):767-769. DOI: 10.1093/bioinformatics/btw732. View

11.

Zou Q, Li X, Jiang W, Lin Z, Li G, Chen K . Survey of MapReduce frame operation in bioinformatics. Brief Bioinform. 2013; 15(4):637-47. DOI: 10.1093/bib/bbs088. View

12.

Zhou W, Li R, Yuan S, Liu C, Yao S, Luo J . MetaSpark: a spark-based distributed processing tool to recruit metagenomic reads to reference genomes. Bioinformatics. 2017; 33(7):1090-1092. DOI: 10.1093/bioinformatics/btw750. View

13.

Kelly B, Fitch J, Hu Y, Corsmeier D, Zhong H, Wetzel A . Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics. Genome Biol. 2015; 16:6. PMC: 4333267. DOI: 10.1186/s13059-014-0577-x. View

14.

Di Tommaso P, Moretti S, Xenarios I, Orobitg M, Montanyola A, Chang J . T-Coffee: a web server for the multiple sequence alignment of protein and RNA sequences using structural information and homology extension. Nucleic Acids Res. 2011; 39(Web Server issue):W13-7. PMC: 3125728. DOI: 10.1093/nar/gkr245. View

15.

Xu X, Ji Z, Zhang Z . CloudPhylo: a fast and scalable tool for phylogeny reconstruction. Bioinformatics. 2017; 33(3):438-440. DOI: 10.1093/bioinformatics/btw645. View

16.

Klein M, Sharma R, Bohrer C, Avelis C, Roberts E . Biospark: scalable analysis of large numerical datasets from biological simulations and experiments using Hadoop and Spark. Bioinformatics. 2016; 33(2):303-305. PMC: 6276899. DOI: 10.1093/bioinformatics/btw614. View

17.

Decap D, Reumers J, Herzeel C, Costanza P, Fostier J . Halvade: scalable sequence analysis with MapReduce. Bioinformatics. 2015; 31(15):2482-8. PMC: 4514927. DOI: 10.1093/bioinformatics/btv179. View

18.

Abuin J, Pichel J, Pena T, Amigo J . SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data. PLoS One. 2016; 11(5):e0155461. PMC: 4868289. DOI: 10.1371/journal.pone.0155461. View

19.

Zhang D, Zhao L, Li B, He Z, Wang G, Liu D . SEQSpark: A Complete Analysis Tool for Large-Scale Rare Variant Association Studies Using Whole-Genome and Exome Sequence Data. Am J Hum Genet. 2017; 101(1):115-122. PMC: 5501866. DOI: 10.1016/j.ajhg.2017.05.017. View

20.

Li H, Durbin R . Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010; 26(5):589-95. PMC: 2828108. DOI: 10.1093/bioinformatics/btp698. View