» Articles » PMID: 38940177

MolLM: a Unified Language Model for Integrating Biomedical Text with 2D and 3D Molecular Representations

Overview
Journal Bioinformatics
Specialty Biology
Date 2024 Jun 28
PMID 38940177
Authors
Affiliations
Soon will be listed here.
Abstract

Motivation: The current paradigm of deep learning models for the joint representation of molecules and text primarily relies on 1D or 2D molecular formats, neglecting significant 3D structural information that offers valuable physical insight. This narrow focus inhibits the models' versatility and adaptability across a wide range of modalities. Conversely, the limited research focusing on explicit 3D representation tends to overlook textual data within the biomedical domain.

Results: We present a unified pre-trained language model, MolLM, that concurrently captures 2D and 3D molecular information alongside biomedical text. MolLM consists of a text Transformer encoder and a molecular Transformer encoder, designed to encode both 2D and 3D molecular structures. To support MolLM's self-supervised pre-training, we constructed 160K molecule-text pairings. Employing contrastive learning as a supervisory signal for learning, MolLM demonstrates robust molecular representation capabilities across four downstream tasks, including cross-modal molecule and text matching, property prediction, captioning, and text-prompted molecular editing. Through ablation, we demonstrate that the inclusion of explicit 3D representations improves performance in these downstream tasks.

Availability And Implementation: Our code, data, pre-trained model weights, and examples of using our model are all available at https://github.com/gersteinlab/MolLM. In particular, we provide Jupyter Notebooks offering step-by-step guidance on how to use MolLM to extract embeddings for both molecules and text.

Citing Articles

A survey of generative AI for de novo drug design: new frontiers in molecule and protein generation.

Tang X, Dai H, Knight E, Wu F, Li Y, Li T Brief Bioinform. 2024; 25(4).

PMID: 39007594 PMC: 11247410. DOI: 10.1093/bib/bbae338.

References
1.
Luo R, Sun L, Xia Y, Qin T, Zhang S, Poon H . BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief Bioinform. 2022; 23(6). DOI: 10.1093/bib/bbac409. View

2.
Liu J, Lei X, Zhang Y, Pan Y . The prediction of molecular toxicity based on BiGRU and GraphSAGE. Comput Biol Med. 2023; 153:106524. DOI: 10.1016/j.compbiomed.2022.106524. View

3.
Kuenzi B, Park J, Fong S, Sanchez K, Lee J, Kreisberg J . Predicting Drug Response and Synergy Using a Deep Learning Model of Human Cancer Cells. Cancer Cell. 2020; 38(5):672-684.e6. PMC: 7737474. DOI: 10.1016/j.ccell.2020.09.014. View

4.
Zeng Z, Yao Y, Liu Z, Sun M . A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals. Nat Commun. 2022; 13(1):862. PMC: 8844428. DOI: 10.1038/s41467-022-28494-3. View

5.
Jiang J, Wang R, Wei G . GGL-Tox: Geometric Graph Learning for Toxicity Prediction. J Chem Inf Model. 2021; 61(4):1691-1700. PMC: 8155789. DOI: 10.1021/acs.jcim.0c01294. View