PLM_Sol: Predicting Protein Solubility by Benchmarking Multiple Protein Language Models with the Updated Escherichia Coli Protein Solubility Dataset

Overview

Journal Brief Bioinform

Publisher Oxford University Press

Specialty Biology

Date 2024 Aug 23

PMID 39179250

Authors

Xuechun Zhang

Xiaoxuan Hu

Tongtong Zhang

Ling Yang

Chunhong Liu

Ning Xu

Haoyi Wang

Wen Sun

Affiliations

Soon will be listed here.

Abstract

Protein solubility plays a crucial role in various biotechnological, industrial, and biomedical applications. With the reduction in sequencing and gene synthesis costs, the adoption of high-throughput experimental screening coupled with tailored bioinformatic prediction has witnessed a rapidly growing trend for the development of novel functional enzymes of interest (EOI). High protein solubility rates are essential in this process and accurate prediction of solubility is a challenging task. As deep learning technology continues to evolve, attention-based protein language models (PLMs) can extract intrinsic information from protein sequences to a greater extent. Leveraging these models along with the increasing availability of protein solubility data inferred from structural database like the Protein Data Bank holds great potential to enhance the prediction of protein solubility. In this study, we curated an Updated Escherichia coli protein Solubility DataSet (UESolDS) and employed a combination of multiple PLMs and classification layers to predict protein solubility. The resulting best-performing model, named Protein Language Model-based protein Solubility prediction model (PLM_Sol), demonstrated significant improvements over previous reported models, achieving a notable 6.4% increase in accuracy, 9.0% increase in F1_score, and 11.1% increase in Matthews correlation coefficient score on the independent test set. Moreover, additional evaluation utilizing our in-house synthesized protein resource as test data, encompassing diverse types of enzymes, also showcased the good performance of PLM_Sol. Overall, PLM_Sol exhibited consistent and promising performance across both independent test set and experimental set, thereby making it well suited for facilitating large-scale EOI studies. PLM_Sol is available as a standalone program and as an easy-to-use model at https://zenodo.org/doi/10.5281/zenodo.10675340.

References

Suzek B, Wang Y, Huang H, McGarvey P, Wu C . UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics. 2014; 31(6):926-32. PMC: 4375400. DOI: 10.1093/bioinformatics/btu739. View

. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 2022; 51(D1):D523-D531. PMC: 9825514. DOI: 10.1093/nar/gkac1052. View

Steinegger M, Soding J . MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017; 35(11):1026-1028. DOI: 10.1038/nbt.3988. View

Rawi R, Mall R, Kunji K, Shen C, Kwong P, Chuang G . PaRSnIP: sequence-based protein solubility prediction using gradient boosting machine. Bioinformatics. 2017; 34(7):1092-1098. PMC: 6031027. DOI: 10.1093/bioinformatics/btx662. View

Jain K, Salamat-Miller N, Taylor K . Freeze-thaw characterization process to minimize aggregation and enable drug product manufacturing of protein based therapeutics. Sci Rep. 2021; 11(1):11332. PMC: 8166975. DOI: 10.1038/s41598-021-90772-9. View

Berman H, Westbrook J, Feng Z, Gilliland G, Bhat T, Weissig H . The Protein Data Bank. Nucleic Acids Res. 1999; 28(1):235-42. PMC: 102472. DOI: 10.1093/nar/28.1.235. View

Seiler C, Park J, Sharma A, Hunter P, Surapaneni P, Sedillo C . DNASU plasmid and PSI:Biology-Materials repositories: resources to accelerate biological research. Nucleic Acids Res. 2013; 42(Database issue):D1253-60. PMC: 3964992. DOI: 10.1093/nar/gkt1060. View

KROGH A, Larsson B, von Heijne G, Sonnhammer E . Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001; 305(3):567-80. DOI: 10.1006/jmbi.2000.4315. View

Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J . Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci U S A. 2021; 118(15). PMC: 8053943. DOI: 10.1073/pnas.2016239118. View

10.

Thumuluri V, Martiny H, Armenteros J, Salomon J, Nielsen H, Johansen A . NetSolP: predicting protein solubility in Escherichia coli using language models. Bioinformatics. 2022; 38(4):941-946. DOI: 10.1093/bioinformatics/btab801. View

11.

Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W . Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023; 379(6637):1123-1130. DOI: 10.1126/science.ade2574. View

12.

Smialowski P, Martin-Galiano A, Mikolajka A, Girschick T, Holak T, Frishman D . Protein solubility: sequence based prediction and experimental verification. Bioinformatics. 2006; 23(19):2536-42. DOI: 10.1093/bioinformatics/btl623. View

13.

Hon J, Marusiak M, Martinek T, Kunka A, Zendulka J, Bednar D . SoluProt: prediction of soluble protein expression in Escherichia coli. Bioinformatics. 2021; 37(1):23-28. PMC: 8034534. DOI: 10.1093/bioinformatics/btaa1102. View

14.

Khurana S, Rawi R, Kunji K, Chuang G, Bensmail H, Mall R . DeepSol: a deep learning framework for sequence-based protein solubility prediction. Bioinformatics. 2018; 34(15):2605-2613. PMC: 6355112. DOI: 10.1093/bioinformatics/bty166. View

15.

Jiang X, Song C, Xu Y, Li Y, Peng Y . Research on sentiment classification for netizens based on the BERT-BiLSTM-TextCNN model. PeerJ Comput Sci. 2022; 8:e1005. PMC: 9202631. DOI: 10.7717/peerj-cs.1005. View

16.

Boratyn G, Camacho C, Cooper P, Coulouris G, Fong A, Ma N . BLAST: a more efficient report with usability improvements. Nucleic Acids Res. 2013; 41(Web Server issue):W29-33. PMC: 3692093. DOI: 10.1093/nar/gkt282. View

17.

Chiti F, Dobson C . Protein Misfolding, Amyloid Formation, and Human Disease: A Summary of Progress Over the Last Decade. Annu Rev Biochem. 2017; 86:27-68. DOI: 10.1146/annurev-biochem-061516-045115. View

18.

Shih Y, Kung W, Yeh C, Wang A, Wang T . High-throughput screening of soluble recombinant proteins. Protein Sci. 2002; 11(7):1714-9. PMC: 2373646. DOI: 10.1110/ps.0205202. View

19.

Ventura S . Sequence determinants of protein aggregation: tools to increase protein solubility. Microb Cell Fact. 2005; 4(1):11. PMC: 1087874. DOI: 10.1186/1475-2859-4-11. View

20.

Dyda F, Chandler M, Hickman A . The emerging diversity of transpososome architectures. Q Rev Biophys. 2012; 45(4):493-521. PMC: 7292550. DOI: 10.1017/S0033583512000145. View