Methods for Evaluating Unsupervised Vector Representations of Genomic Regions

Overview

Journal NAR Genom Bioinform

Publisher Oxford University Press

Specialty Biology

Date 2024 Aug 12

PMID 39131817

Authors

Guangtao Zheng

Julia Rymuza

Erfaneh Gharavi

Nathan J LeRoy

Aidong Zhang

Nathan C Sheffield

Affiliations

Soon will be listed here.

Abstract

Representation learning models have become a mainstay of modern genomics. These models are trained to yield vector representations, or embeddings, of various biological entities, such as cells, genes, individuals, or genomic regions. Recent applications of unsupervised embedding approaches have been shown to learn relationships among genomic regions that define functional elements in a genome. Unsupervised representation learning of genomic regions is free of the supervision from curated metadata and can condense rich biological knowledge from publicly available data to region embeddings. However, there exists no method for evaluating the quality of these embeddings in the absence of metadata, making it difficult to assess the reliability of analyses based on the embeddings, and to tune model training to yield optimal results. To bridge this gap, we propose four evaluation metrics: the cluster tendency score (CTS), the reconstruction score (RCS), the genome distance scaling score (GDSS), and the neighborhood preserving score (NPS). The CTS and RCS statistically quantify how well region embeddings can be clustered and how well the embeddings preserve information in training data. The GDSS and NPS exploit the biological tendency of regions close in genomic space to have similar biological functions; they measure how much such information is captured by individual region embeddings in a set. We demonstrate the utility of these statistical and biological scores for evaluating unsupervised genomic region embeddings and provide guidelines for learning reliable embeddings.

Citing Articles

Methods for constructing and evaluating consensus genomic interval sets.

Rymuza J, Sun Y, Zheng G, LeRoy N, Murach M, Phan N Nucleic Acids Res. 2024; 52(17):10119-10131.

PMID: 39180401 PMC: 11417377. DOI: 10.1093/nar/gkae685.

Joint Representation Learning for Retrieval and Annotation of Genomic Interval Sets.

Gharavi E, LeRoy N, Zheng G, Zhang A, Brown D, Sheffield N Bioengineering (Basel). 2024; 11(3).

PMID: 38534537 PMC: 10967841. DOI: 10.3390/bioengineering11030263.

References

Dozmorov M . Epigenomic annotation-based interpretation of genomic data: from enrichment analysis to machine learning. Bioinformatics. 2017; 33(20):3323-3330. DOI: 10.1093/bioinformatics/btx414. View

Sheffield N, Bock C . LOLA: enrichment analysis for genomic region sets and regulatory elements in R and Bioconductor. Bioinformatics. 2015; 32(4):587-9. PMC: 4743627. DOI: 10.1093/bioinformatics/btv612. View

Sheffield N, Thurman R, Song L, Safi A, Stamatoyannopoulos J, Lenhard B . Patterns of regulatory activity across diverse human cell types predict tissue identity, transcription factor binding, and long-range interactions. Genome Res. 2013; 23(5):777-88. PMC: 3638134. DOI: 10.1101/gr.152140.112. View

Barrett T, Wilhite S, Ledoux P, Evangelista C, Kim I, Tomashevsky M . NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res. 2012; 41(Database issue):D991-5. PMC: 3531084. DOI: 10.1093/nar/gks1193. View

Koch L . Cancer genomics: Non-coding mutations in the driver seat. Nat Rev Genet. 2014; 15(9):574-5. DOI: 10.1038/nrg3801. View

Maurano M, Humbert R, Rynes E, Thurman R, Haugen E, Wang H . Systematic localization of common disease-associated variation in regulatory DNA. Science. 2012; 337(6099):1190-5. PMC: 3771521. DOI: 10.1126/science.1222794. View

. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008; 455(7216):1061-8. PMC: 2671642. DOI: 10.1038/nature07385. View

Sheffield N, Furey T . Identifying and characterizing regulatory sequences in the human genome with chromatin accessibility assays. Genes (Basel). 2014; 3(4):651-70. PMC: 3899983. DOI: 10.3390/genes3040651. View

Quinlan A, Hall I . BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010; 26(6):841-2. PMC: 2832824. DOI: 10.1093/bioinformatics/btq033. View

10.

Gharavi E, Gu A, Zheng G, Smith J, Cho H, Zhang A . Embeddings of genomic region sets capture rich biological associations in lower dimensions. Bioinformatics. 2021; 37(23):4299-4306. PMC: 8652032. DOI: 10.1093/bioinformatics/btab439. View

11.

Portela A, Esteller M . Epigenetic modifications and human disease. Nat Biotechnol. 2010; 28(10):1057-68. DOI: 10.1038/nbt.1685. View

12.

LeRoy N, Smith J, Zheng G, Rymuza J, Gharavi E, Brown D . Fast clustering and cell-type annotation of scATAC data using pre-trained embeddings. NAR Genom Bioinform. 2024; 6(3):lqae073. PMC: 11224678. DOI: 10.1093/nargab/lqae073. View

13.

Smith J, Sheffield N . Analytical Approaches for ATAC-seq Data Analysis. Curr Protoc Hum Genet. 2020; 106(1):e101. PMC: 8191135. DOI: 10.1002/cphg.101. View

14.

Xue B, Khoroshevskyi O, Gomez R, Sheffield N . Opportunities and challenges in sharing and reusing genomic interval data. Front Genet. 2023; 14:1155809. PMC: 10067617. DOI: 10.3389/fgene.2023.1155809. View

15.

. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 489(7414):57-74. PMC: 3439153. DOI: 10.1038/nature11247. View

16.

Khoroshevskyi O, Leroy N, Reuter V, Sheffield N . GEOfetch: a command-line tool for downloading data and standardized metadata from GEO and SRA. Bioinformatics. 2023; 39(3). PMC: 9982356. DOI: 10.1093/bioinformatics/btad069. View

17.

Buenrostro J, Giresi P, Zaba L, Chang H, Greenleaf W . Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat Methods. 2013; 10(12):1213-8. PMC: 3959825. DOI: 10.1038/nmeth.2688. View

18.

Furey T . ChIP-seq and beyond: new and improved methodologies to detect and characterize protein-DNA interactions. Nat Rev Genet. 2012; 13(12):840-52. PMC: 3591838. DOI: 10.1038/nrg3306. View

19.

Gharavi E, LeRoy N, Zheng G, Zhang A, Brown D, Sheffield N . Joint Representation Learning for Retrieval and Annotation of Genomic Interval Sets. Bioengineering (Basel). 2024; 11(3). PMC: 10967841. DOI: 10.3390/bioengineering11030263. View