» Articles » PMID: 28977511

Towards Enhanced and Interpretable Clustering/classification in Integrative Genomics

Overview
Specialty Biochemistry
Date 2017 Oct 5
PMID 28977511
Citations 1
Authors
Affiliations
Soon will be listed here.
Abstract

High-throughput technologies have led to large collections of different types of biological data that provide unprecedented opportunities to unravel molecular heterogeneity of biological processes. Nevertheless, how to jointly explore data from multiple sources into a holistic, biologically meaningful interpretation remains challenging. In this work, we propose a scalable and tuning-free preprocessing framework, Heterogeneity Rescaling Pursuit (Hetero-RP), which weighs important features more highly than less important ones in accord with implicitly existing auxiliary knowledge. Finally, we demonstrate effectiveness of Hetero-RP in diverse clustering and classification applications. More importantly, Hetero-RP offers an interpretation of feature importance, shedding light on the driving forces of the underlying biology. In metagenomic contig binning, Hetero-RP automatically weighs abundance and composition profiles according to the varying number of samples, resulting in markedly improved performance of contig binning. In RNA-binding protein (RBP) binding site prediction, Hetero-RP not only improves the prediction performance measured by the area under the receiver operating characteristic curves (AUC), but also uncovers the evidence supported by independent studies, including the distribution of the binding sites of IGF2BP and PUM2, the binding competition between hnRNPC and U2AF2, and the intron-exon boundary of U2AF2 [availability: https://github.com/younglululu/Hetero-RP].

Citing Articles

SolidBin: improving metagenome binning with semi-supervised normalized cut.

Wang Z, Wang Z, Lu Y, Sun F, Zhu S Bioinformatics. 2019; 35(21):4229-4238.

PMID: 30977806 PMC: 6821242. DOI: 10.1093/bioinformatics/btz253.

References
1.
Imelfort M, Parks D, Woodcroft B, Dennis P, Hugenholtz P, Tyson G . GroopM: an automated tool for the recovery of population genomes from related metagenomes. PeerJ. 2014; 2:e603. PMC: 4183954. DOI: 10.7717/peerj.603. View

2.
Wu Y, Simmons B, Singer S . MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics. 2015; 32(4):605-7. DOI: 10.1093/bioinformatics/btv638. View

3.
Hirschhorn J, Daly M . Genome-wide association studies for common diseases and complex traits. Nat Rev Genet. 2005; 6(2):95-108. DOI: 10.1038/nrg1521. View

4.
Fan J, Lv J . A Selective Overview of Variable Selection in High Dimensional Feature Space. Stat Sin. 2011; 20(1):101-148. PMC: 3092303. View

5.
Shevade S, Keerthi S . A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics. 2003; 19(17):2246-53. DOI: 10.1093/bioinformatics/btg308. View