» Articles » PMID: 30137247

Impact of Similarity Metrics on Single-cell RNA-seq Data Clustering

Overview
Journal Brief Bioinform
Specialty Biology
Date 2018 Aug 24
PMID 30137247
Citations 53
Authors
Affiliations
Soon will be listed here.
Abstract

Advances in high-throughput sequencing on single-cell gene expressions [single-cell RNA sequencing (scRNA-seq)] have enabled transcriptome profiling on individual cells from complex samples. A common goal in scRNA-seq data analysis is to discover and characterise cell types, typically through clustering methods. The quality of the clustering therefore plays a critical role in biological discovery. While numerous clustering algorithms have been proposed for scRNA-seq data, fundamentally they all rely on a similarity metric for categorising individual cells. Although several studies have compared the performance of various clustering algorithms for scRNA-seq data, currently there is no benchmark of different similarity metrics and their influence on scRNA-seq data clustering. Here, we compared a panel of similarity metrics on clustering a collection of annotated scRNA-seq datasets. Within each dataset, a stratified subsampling procedure was applied and an array of evaluation measures was employed to assess the similarity metrics. This produced a highly reliable and reproducible consensus on their performance assessment. Overall, we found that correlation-based metrics (e.g. Pearson's correlation) outperformed distance-based metrics (e.g. Euclidean distance). To test if the use of correlation-based metrics can benefit the recently published clustering techniques for scRNA-seq data, we modified a state-of-the-art kernel-based clustering algorithm (SIMLR) using Pearson's correlation as a similarity measure and found significant performance improvement over Euclidean distance on scRNA-seq data clustering. These findings demonstrate the importance of similarity metrics in clustering scRNA-seq data and highlight Pearson's correlation as a favourable choice. Further comparison on different scRNA-seq library preparation protocols suggests that they may also affect clustering performance. Finally, the benchmarking framework is available at http://www.maths.usyd.edu.au/u/SMS/bioinformatics/software.html.

Citing Articles

Single cell RNA sequencing improves the next generation of approaches to AML treatment: challenges and perspectives.

Khosroabadi Z, Azaryar S, Dianat-Moghadam H, Amoozgar Z, Sharifi M Mol Med. 2025; 31(1):33.

PMID: 39885388 PMC: 11783831. DOI: 10.1186/s10020-025-01085-w.


scEGG: an exogenous gene-guided clustering method for single-cell transcriptomic data.

Hu D, Guan R, Liang K, Yu H, Quan H, Zhao Y Brief Bioinform. 2024; 25(6).

PMID: 39344711 PMC: 11440090. DOI: 10.1093/bib/bbae483.


scConfluence: single-cell diagonal integration with regularized Inverse Optimal Transport on weakly connected features.

Samaran J, Peyre G, Cantini L Nat Commun. 2024; 15(1):7762.

PMID: 39237488 PMC: 11377776. DOI: 10.1038/s41467-024-51382-x.


Improving replicability in single-cell RNA-Seq cell type discovery with Dune.

Roux de Bezieux H, Street K, Fischer S, Van den Berge K, Chance R, Risso D BMC Bioinformatics. 2024; 25(1):198.

PMID: 38789920 PMC: 11127396. DOI: 10.1186/s12859-024-05814-6.


The effect of data transformation on low-dimensional integration of single-cell RNA-seq.

Park Y, Hauschild A BMC Bioinformatics. 2024; 25(1):171.

PMID: 38689234 PMC: 11059821. DOI: 10.1186/s12859-024-05788-5.