» Articles » PMID: 39131817

Methods for Evaluating Unsupervised Vector Representations of Genomic Regions

Overview
Specialty Biology
Date 2024 Aug 12
PMID 39131817
Authors
Affiliations
Soon will be listed here.
Abstract

Representation learning models have become a mainstay of modern genomics. These models are trained to yield vector representations, or embeddings, of various biological entities, such as cells, genes, individuals, or genomic regions. Recent applications of unsupervised embedding approaches have been shown to learn relationships among genomic regions that define functional elements in a genome. Unsupervised representation learning of genomic regions is free of the supervision from curated metadata and can condense rich biological knowledge from publicly available data to region embeddings. However, there exists no method for evaluating the quality of these embeddings in the absence of metadata, making it difficult to assess the reliability of analyses based on the embeddings, and to tune model training to yield optimal results. To bridge this gap, we propose four evaluation metrics: the cluster tendency score (CTS), the reconstruction score (RCS), the genome distance scaling score (GDSS), and the neighborhood preserving score (NPS). The CTS and RCS statistically quantify how well region embeddings can be clustered and how well the embeddings preserve information in training data. The GDSS and NPS exploit the biological tendency of regions close in genomic space to have similar biological functions; they measure how much such information is captured by individual region embeddings in a set. We demonstrate the utility of these statistical and biological scores for evaluating unsupervised genomic region embeddings and provide guidelines for learning reliable embeddings.

Citing Articles

Methods for constructing and evaluating consensus genomic interval sets.

Rymuza J, Sun Y, Zheng G, LeRoy N, Murach M, Phan N Nucleic Acids Res. 2024; 52(17):10119-10131.

PMID: 39180401 PMC: 11417377. DOI: 10.1093/nar/gkae685.


Joint Representation Learning for Retrieval and Annotation of Genomic Interval Sets.

Gharavi E, LeRoy N, Zheng G, Zhang A, Brown D, Sheffield N Bioengineering (Basel). 2024; 11(3).

PMID: 38534537 PMC: 10967841. DOI: 10.3390/bioengineering11030263.

References
1.
Dozmorov M . Epigenomic annotation-based interpretation of genomic data: from enrichment analysis to machine learning. Bioinformatics. 2017; 33(20):3323-3330. DOI: 10.1093/bioinformatics/btx414. View

2.
Sheffield N, Bock C . LOLA: enrichment analysis for genomic region sets and regulatory elements in R and Bioconductor. Bioinformatics. 2015; 32(4):587-9. PMC: 4743627. DOI: 10.1093/bioinformatics/btv612. View

3.
Sheffield N, Thurman R, Song L, Safi A, Stamatoyannopoulos J, Lenhard B . Patterns of regulatory activity across diverse human cell types predict tissue identity, transcription factor binding, and long-range interactions. Genome Res. 2013; 23(5):777-88. PMC: 3638134. DOI: 10.1101/gr.152140.112. View

4.
Barrett T, Wilhite S, Ledoux P, Evangelista C, Kim I, Tomashevsky M . NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res. 2012; 41(Database issue):D991-5. PMC: 3531084. DOI: 10.1093/nar/gks1193. View

5.
Koch L . Cancer genomics: Non-coding mutations in the driver seat. Nat Rev Genet. 2014; 15(9):574-5. DOI: 10.1038/nrg3801. View