ISeg: an Efficient Algorithm for Segmentation of Genomic and Epigenomic Data

Overview

Journal BMC Bioinformatics

Publisher Biomed Central

Specialty Biology

Date 2018 Apr 13

PMID 29642840

Citations 8

Authors

Senthil B Girimurugan

Yuhang Liu

Pei-Yau Lung

Daniel L Vera

Jonathan H Dennis

Hank W Bass

Jinfeng Zhang

Affiliations

Soon will be listed here.

Abstract

Background: Identification of functional elements of a genome often requires dividing a sequence of measurements along a genome into segments where adjacent segments have different properties, such as different mean values. Despite dozens of algorithms developed to address this problem in genomics research, methods with improved accuracy and speed are still needed to effectively tackle both existing and emerging genomic and epigenomic segmentation problems.

Results: We designed an efficient algorithm, called iSeg, for segmentation of genomic and epigenomic profiles. iSeg first utilizes dynamic programming to identify candidate segments and test for significance. It then uses a novel data structure based on two coupled balanced binary trees to detect overlapping significant segments and update them simultaneously during searching and refinement stages. Refinement and merging of significant segments are performed at the end to generate the final set of segments. By using an objective function based on the p-values of the segments, the algorithm can serve as a general computational framework to be combined with different assumptions on the distributions of the data. As a general segmentation method, it can segment different types of genomic and epigenomic data, such as DNA copy number variation, nucleosome occupancy, nuclease sensitivity, and differential nuclease sensitivity data. Using simple t-tests to compute p-values across multiple datasets of different types, we evaluate iSeg using both simulated and experimental datasets and show that it performs satisfactorily when compared with some other popular methods, which often employ more sophisticated statistical models. Implemented in C++, iSeg is also very computationally efficient, well suited for large numbers of input profiles and data with very long sequences.

Conclusions: We have developed an efficient general-purpose segmentation tool and showed that it had comparable or more accurate results than many of the most popular segment-calling algorithms used in contemporary genomic data analysis. iSeg is capable of analyzing datasets that have both positive and negative values. Tunable parameters allow users to readily adjust the statistical stringency to best match the biological nature of individual datasets, including widely or sparsely mapped genomic datasets or those with non-normal distributions.

Citing Articles

Evolutionary Dynamics of Chromatin Structure and Duplicate Gene Expression in Diploid and Allopolyploid Cotton.

Hu G, Grover C, Vera D, Lung P, Girimurugan S, Miller E Mol Biol Evol. 2024; 41(5).

PMID: 38758089 PMC: 11140268. DOI: 10.1093/molbev/msae095.

DeepRegFinder: deep learning-based regulatory elements finder.

Ramakrishnan A, Wangensteen G, Kim S, Nestler E, Shen L Bioinform Adv. 2024; 4(1):vbae007.

PMID: 38343388 PMC: 10858349. DOI: 10.1093/bioadv/vbae007.

Segmentation and genome annotation algorithms for identifying chromatin state and other genomic patterns.

Libbrecht M, Chan R, Hoffman M PLoS Comput Biol. 2021; 17(10):e1009423.

PMID: 34648491 PMC: 8516206. DOI: 10.1371/journal.pcbi.1009423.

NucHMM: a method for quantitative modeling of nucleosome organization identifying functional nucleosome states distinctly associated with splicing potentiality.

Fang K, Li T, Huang Y, Jin V Genome Biol. 2021; 22(1):250.

PMID: 34446075 PMC: 8390234. DOI: 10.1186/s13059-021-02465-1.

The native cistrome and sequence motif families of the maize ear.

Savadel S, Hartwig T, Turpin Z, Vera D, Lung P, Sui X PLoS Genet. 2021; 17(8):e1009689.

PMID: 34383745 PMC: 8360572. DOI: 10.1371/journal.pgen.1009689.

References

Kharchenko P, Tolstorukov M, Park P . Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat Biotechnol. 2008; 26(12):1351-9. PMC: 2597701. DOI: 10.1038/nbt.1508. View

Roy S, Motsinger Reif A . Evaluation of calling algorithms for array-CGH. Front Genet. 2013; 4:217. PMC: 3829466. DOI: 10.3389/fgene.2013.00217. View

Vera D, Madzima T, Labonne J, Alam M, Hoffman G, Girimurugan S . Differential nuclease sensitivity profiling of chromatin reveals biochemical footprints coupled to gene expression and functional DNA elements in maize. Plant Cell. 2014; 26(10):3883-93. PMC: 4247582. DOI: 10.1105/tpc.114.130609. View

Hoffman M, Ernst J, Wilder S, Kundaje A, Harris R, Libbrecht M . Integrative annotation of chromatin elements from ENCODE data. Nucleic Acids Res. 2012; 41(2):827-41. PMC: 3553955. DOI: 10.1093/nar/gks1284. View

Niu Y, Zhang H . THE SCREENING AND RANKING ALGORITHM TO DETECT DNA COPY NUMBER VARIATIONS. Ann Appl Stat. 2013; 6(3):1306-1326. PMC: 3779928. DOI: 10.1214/12-AOAS539SUPP. View

Tibshirani R, Wang P . Spatial smoothing and hot spot detection for CGH data using the fused lasso. Biostatistics. 2007; 9(1):18-29. DOI: 10.1093/biostatistics/kxm013. View

Zhang N, Siegmund D . A modified Bayes information criterion with applications to the analysis of comparative genomic hybridization data. Biometrics. 2007; 63(1):22-32. DOI: 10.1111/j.1541-0420.2006.00662.x. View

Picard F, Robin S, Lavielle M, Vaisse C, Daudin J . A statistical approach for array CGH data analysis. BMC Bioinformatics. 2005; 6:27. PMC: 549559. DOI: 10.1186/1471-2105-6-27. View

Marioni J, Mason C, Mane S, Stephens M, Gilad Y . RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008; 18(9):1509-17. PMC: 2527709. DOI: 10.1101/gr.079558.108. View

10.

Wang K, Li M, Hadley D, Liu R, Glessner J, Grant S . PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 2007; 17(11):1665-74. PMC: 2045149. DOI: 10.1101/gr.6861907. View

11.

Chen J, Wang Y . A statistical change point model approach for the detection of DNA copy number variations in array CGH data. IEEE/ACM Trans Comput Biol Bioinform. 2009; 6(4):529-41. PMC: 4154476. DOI: 10.1109/TCBB.2008.129. View

12.

Picard F, Robin S, Lebarbier E, Daudin J . A segmentation/clustering model for the analysis of array CGH data. Biometrics. 2007; 63(3):758-66. DOI: 10.1111/j.1541-0420.2006.00729.x. View

13.

Park P . Experimental design and data analysis for array comparative genomic hybridization. Cancer Invest. 2008; 26(9):923-8. DOI: 10.1080/07357900801993432. View

14.

Schnable P, Ware D, Fulton R, Stein J, Wei F, Pasternak S . The B73 maize genome: complexity, diversity, and dynamics. Science. 2009; 326(5956):1112-5. DOI: 10.1126/science.1178534. View

15.

Morganella S, Cerulo L, Viglietto G, Ceccarelli M . VEGA: variational segmentation for copy number detection. Bioinformatics. 2010; 26(24):3020-7. DOI: 10.1093/bioinformatics/btq586. View

16.

Ben-Yaacov E, Eldar Y . A fast and flexible method for the segmentation of aCGH data. Bioinformatics. 2008; 24(16):i139-45. DOI: 10.1093/bioinformatics/btn272. View

17.

David L, Huber W, Granovskaia M, Toedling J, Palm C, Bofkin L . A high-resolution map of transcription in the yeast genome. Proc Natl Acad Sci U S A. 2006; 103(14):5320-5. PMC: 1414796. DOI: 10.1073/pnas.0601091103. View

18.

Venkatraman E, Olshen A . A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007; 23(6):657-63. DOI: 10.1093/bioinformatics/btl646. View

19.

. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 489(7414):57-74. PMC: 3439153. DOI: 10.1038/nature11247. View

20.

Willenbrock H, Fridlyand J . A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics. 2005; 21(22):4084-91. DOI: 10.1093/bioinformatics/bti677. View