» Articles » PMID: 37292476

Dirichlet Diffusion Score Model for Biological Sequence Generation

Overview
Journal ArXiv
Date 2023 Jun 9
PMID 37292476
Authors
Affiliations
Soon will be listed here.
Abstract

Designing biological sequences is an important challenge that requires satisfying complex constraints and thus is a natural problem to address with deep generative modeling. Diffusion generative models have achieved considerable success in many applications. Score-based generative stochastic differential equations (SDE) model is a continuous-time diffusion model framework that enjoys many benefits, but the originally proposed SDEs are not naturally designed for modeling discrete data. To develop generative SDE models for discrete data such as biological sequences, here we introduce a diffusion process defined in the probability simplex space with stationary distribution being the Dirichlet distribution. This makes diffusion in continuous space natural for modeling discrete data. We refer to this approach as Dirchlet diffusion score model. We demonstrate that this technique can generate samples that satisfy hard constraints using a Sudoku generation task. This generative model can also solve Sudoku, including hard puzzles, without additional training. Finally, we applied this approach to develop the first human promoter DNA sequence design model and showed that designed sequences share similar properties with natural promoter sequences.

References
1.
Forrest A, Kawaji H, Rehli M, Baillie J, de Hoon M, Haberle V . A promoter-level mammalian expression atlas. Nature. 2014; 507(7493):462-70. PMC: 4529748. DOI: 10.1038/nature13182. View

2.
Steinrucken M, Wang Y, Song Y . An explicit transition density expansion for a multi-allelic Wright-Fisher diffusion with general diploid selection. Theor Popul Biol. 2012; 83:1-14. PMC: 3568258. DOI: 10.1016/j.tpb.2012.10.006. View

3.
Wang Y, Wang H, Wei L, Li S, Liu L, Wang X . Synthetic promoter design in Escherichia coli based on a deep generative network. Nucleic Acids Res. 2020; 48(12):6403-6412. PMC: 7337522. DOI: 10.1093/nar/gkaa325. View

4.
Dauparas J, Anishchenko I, Bennett N, Bai H, Ragotte R, Milles L . Robust deep learning-based protein sequence design using ProteinMPNN. Science. 2022; 378(6615):49-56. PMC: 9997061. DOI: 10.1126/science.add2187. View

5.
Kimura M . Stochastic processes and distribution of gene frequencies under natural selection. Cold Spring Harb Symp Quant Biol. 1955; 20:33-53. DOI: 10.1101/sqb.1955.020.01.006. View