A Novel Method for Predicting Activity of Cis-regulatory Modules, Based on a Diverse Training Set

Overview

Journal Bioinformatics

Publisher Oxford University Press

Specialty Biology

Date 2016 Sep 10

PMID 27609510

Citations 2

Authors

Wei Yang

Saurabh Sinha

Affiliations

Soon will be listed here.

Abstract

Motivation: With the rapid emergence of technologies for locating cis-regulatory modules (CRMs) genome-wide, the next pressing challenge is to assign precise functions to each CRM, i.e. to determine the spatiotemporal domains or cell-types where it drives expression. A popular approach to this task is to model the typical k-mer composition of a set of CRMs known to drive a common expression pattern, and assign that pattern to other CRMs exhibiting a similar k-mer composition. This approach does not rely on prior knowledge of transcription factors relevant to the CRM or their binding motifs, and is thus more widely applicable than motif-based methods for predicting CRM activity, but is also prone to false positive predictions.

Results: We present a novel strategy to improve the above-mentioned approach: to predict if a CRM drives a specific gene expression pattern, assess not only how similar the CRM is to other CRMs with similar activity but also to CRMs with distinct activities. We use a state-of-the-art statistical method to quantify a CRM's sequence similarity to many different training sets of CRMs, and employ a classification algorithm to integrate these similarity scores into a single prediction of the CRM's activity. This strategy is shown to significantly improve CRM activity prediction over current approaches.

Availability And Implementation: Our implementation of the new method, called IMMBoost, is freely available as source code, at https://github.com/weiyangedward/IMMBoost CONTACT: sinhas@illinois.eduSupplementary information: Supplementary data are available at Bioinformatics online.

Citing Articles

Identification of gene specific cis-regulatory elements during differentiation of mouse embryonic stem cells: An integrative approach using high-throughput datasets.

Vijayabaskar M, Goode D, Obier N, Lichtinger M, Emmett A, Zainul Abidin F PLoS Comput Biol. 2019; 15(11):e1007337.

PMID: 31682597 PMC: 6855567. DOI: 10.1371/journal.pcbi.1007337.

CRM Discovery Beyond Model Insects.

Kazemian M, Halfon M Methods Mol Biol. 2018; 1858:117-139.

PMID: 30414115 PMC: 6482005. DOI: 10.1007/978-1-4939-8775-7_10.

References

Narlikar L, Sakabe N, Blanski A, Arimura F, Westlund J, Nobrega M . Genome-wide discovery of human heart enhancers. Genome Res. 2010; 20(3):381-92. PMC: 2840982. DOI: 10.1101/gr.098657.109. View

Ahmad S, Busser B, Huang D, Cozart E, Michaud S, Zhu X . Machine learning classification of cell-specific cardiac enhancers uncovers developmental subnetworks regulating progenitor cell division and cell fate specification. Development. 2014; 141(4):878-88. PMC: 3912831. DOI: 10.1242/dev.101709. View

Giresi P, Kim J, McDaniell R, Iyer V, Lieb J . FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates active regulatory elements from human chromatin. Genome Res. 2006; 17(6):877-85. PMC: 1891346. DOI: 10.1101/gr.5533506. View

Erwin G, Oksenberg N, Truty R, Kostka D, Murphy K, Ahituv N . Integrating diverse datasets improves developmental enhancer prediction. PLoS Comput Biol. 2014; 10(6):e1003677. PMC: 4072507. DOI: 10.1371/journal.pcbi.1003677. View

Bernstein B, Stamatoyannopoulos J, Costello J, Ren B, Milosavljevic A, Meissner A . The NIH Roadmap Epigenomics Mapping Consortium. Nat Biotechnol. 2010; 28(10):1045-8. PMC: 3607281. DOI: 10.1038/nbt1010-1045. View

Lee D, Karchin R, Beer M . Discriminative prediction of mammalian enhancers from DNA sequence. Genome Res. 2011; 21(12):2167-80. PMC: 3227105. DOI: 10.1101/gr.121905.111. View

Blatti C, Kazemian M, Wolfe S, Brodsky M, Sinha S . Integrating motif, DNA accessibility and gene expression data to build regulatory maps in an organism. Nucleic Acids Res. 2015; 43(8):3998-4012. PMC: 4417154. DOI: 10.1093/nar/gkv195. View

Benson G . Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1998; 27(2):573-80. PMC: 148217. DOI: 10.1093/nar/27.2.573. View

Kantorovitz M, Kazemian M, Kinston S, Miranda-Saavedra D, Zhu Q, Robinson G . Motif-blind, genome-wide discovery of cis-regulatory modules in Drosophila and mouse. Dev Cell. 2009; 17(4):568-79. PMC: 2768654. DOI: 10.1016/j.devcel.2009.09.002. View

10.

Arvey A, Agius P, Noble W, Leslie C . Sequence and chromatin determinants of cell-type-specific transcription factor binding. Genome Res. 2012; 22(9):1723-34. PMC: 3431489. DOI: 10.1101/gr.127712.111. View

11.

Aerts S . Computational strategies for the genome-wide identification of cis-regulatory elements and transcriptional targets. Curr Top Dev Biol. 2012; 98:121-45. DOI: 10.1016/B978-0-12-386499-4.00005-7. View

12.

Visel A, Blow M, Li Z, Zhang T, Akiyama J, Holt A . ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature. 2009; 457(7231):854-8. PMC: 2745234. DOI: 10.1038/nature07730. View

13.

Kleftogiannis D, Kalnis P, Bajic V . DEEP: a general computational framework for predicting enhancers. Nucleic Acids Res. 2014; 43(1):e6. PMC: 4288148. DOI: 10.1093/nar/gku1058. View

14.

Frith M, Li M, Weng Z . Cluster-Buster: Finding dense clusters of motifs in DNA sequences. Nucleic Acids Res. 2003; 31(13):3666-8. PMC: 168947. DOI: 10.1093/nar/gkg540. View

15.

Kazemian M, Suryamohan K, Chen J, Zhang Y, Hassan Samee M, Halfon M . Evidence for deep regulatory similarities in early developmental programs across highly diverged insects. Genome Biol Evol. 2014; 6(9):2301-20. PMC: 4217690. DOI: 10.1093/gbe/evu184. View

16.

Ghandi M, Lee D, Mohammad-Noori M, Beer M . Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput Biol. 2014; 10(7):e1003711. PMC: 4102394. DOI: 10.1371/journal.pcbi.1003711. View

17.

Buenrostro J, Giresi P, Zaba L, Chang H, Greenleaf W . Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat Methods. 2013; 10(12):1213-8. PMC: 3959825. DOI: 10.1038/nmeth.2688. View

18.

Kazemian M, Zhu Q, Halfon M, Sinha S . Improved accuracy of supervised CRM discovery with interpolated Markov models and cross-species comparison. Nucleic Acids Res. 2011; 39(22):9463-72. PMC: 3239187. DOI: 10.1093/nar/gkr621. View

19.

Boyle A, Davis S, Shulha H, Meltzer P, Margulies E, Weng Z . High-resolution mapping and characterization of open chromatin across the genome. Cell. 2008; 132(2):311-22. PMC: 2669738. DOI: 10.1016/j.cell.2007.12.014. View

20.

Philippakis A, He F, Bulyk M . Modulefinder: a tool for computational discovery of cis regulatory modules. Pac Symp Biocomput. 2005; :519-30. PMC: 2692613. View