Knowledge-based BERT: a Method to Extract Molecular Features Like Computational Chemists

Overview

Journal Brief Bioinform

Publisher Oxford University Press

Specialty Biology

Date 2022 Apr 19

PMID 35438145

Authors

Zhenxing Wu

Dejun Jiang

Jike Wang

Xujun Zhang

Hongyan Du

Lurong Pan

Chang-Yu Hsieh

Dongsheng Cao

Tingjun Hou

Affiliations

Soon will be listed here.

Abstract

Molecular property prediction models based on machine learning algorithms have become important tools to triage unpromising lead molecules in the early stages of drug discovery. Compared with the mainstream descriptor- and graph-based methods for molecular property predictions, SMILES-based methods can directly extract molecular features from SMILES without human expert knowledge, but they require more powerful algorithms for feature extraction and a larger amount of data for training, which makes SMILES-based methods less popular. Here, we show the great potential of pre-training in promoting the predictions of important pharmaceutical properties. By utilizing three pre-training tasks based on atom feature prediction, molecular feature prediction and contrastive learning, a new pre-training method K-BERT, which can extract chemical information from SMILES like chemists, was developed. The calculation results on 15 pharmaceutical datasets show that K-BERT outperforms well-established descriptor-based (XGBoost) and graph-based (Attentive FP and HRGCN+) models. In addition, we found that the contrastive learning pre-training task enables K-BERT to 'understand' SMILES not limited to canonical SMILES. Moreover, the general fingerprints K-BERT-FP generated by K-BERT exhibit comparative predictive power to MACCS on 15 pharmaceutical datasets and can also capture molecular size and chirality information that traditional binary fingerprints cannot capture. Our results illustrate the great potential of K-BERT in the practical applications of molecular property predictions in drug discovery.

Citing Articles

Foundation models in bioinformatics.

Guo F, Guan R, Li Y, Liu Q, Wang X, Yang C Natl Sci Rev. 2025; 12(4):nwaf028.

PMID: 40078374 PMC: 11900445. DOI: 10.1093/nsr/nwaf028.

Large language models and their applications in bioinformatics.

Sarumi O, Heider D Comput Struct Biotechnol J. 2024; 23:3498-3505.

PMID: 39435343 PMC: 11493188. DOI: 10.1016/j.csbj.2024.09.031.

Machine learning-guided strategies for reaction conditions design and optimization.

Chen L, Li Y Beilstein J Org Chem. 2024; 20:2476-2492.

PMID: 39376489 PMC: 11457048. DOI: 10.3762/bjoc.20.212.

Screening antimicrobial peptides and probiotics using multiple deep learning and directed evolution strategies.

Zhang Y, Liu L, Xu B, Zhang Z, Yang M, He Y Acta Pharm Sin B. 2024; 14(8):3476-3492.

PMID: 39234615 PMC: 11372459. DOI: 10.1016/j.apsb.2024.05.003.

Hybrid fragment-SMILES tokenization for ADMET prediction in drug discovery.

Aksamit N, Tchagang A, Li Y, Ombuki-Berman B BMC Bioinformatics. 2024; 25(1):255.

PMID: 39090573 PMC: 11295479. DOI: 10.1186/s12859-024-05861-z.