» Articles » PMID: 29584811

Learned Protein Embeddings for Machine Learning

Overview
Journal Bioinformatics
Specialty Biology
Date 2018 Mar 28
PMID 29584811
Citations 89
Authors
Affiliations
Soon will be listed here.
Abstract

Motivation: Machine-learning models trained on protein sequences and their measured functions can infer biological properties of unseen sequences without requiring an understanding of the underlying physical or biological mechanisms. Such models enable the prediction and discovery of sequences with optimal properties. Machine-learning models generally require that their inputs be vectors, and the conversion from a protein sequence to a vector representation affects the model's ability to learn. We propose to learn embedded representations of protein sequences that take advantage of the vast quantity of unmeasured protein sequence data available. These embeddings are low-dimensional and can greatly simplify downstream modeling.

Results: The predictive power of Gaussian process models trained using embeddings is comparable to those trained on existing representations, which suggests that embeddings enable accurate predictions despite having orders of magnitude fewer dimensions. Moreover, embeddings are simpler to obtain because they do not require alignments, structural data, or selection of informative amino-acid properties. Visualizing the embedding vectors shows meaningful relationships between the embedded proteins are captured.

Availability And Implementation: The embedding vectors and code to reproduce the results are available at https://github.com/fhalab/embeddings_reproduction/.

Supplementary Information: Supplementary data are available at Bioinformatics online.

Citing Articles

In vivo electrophysiology recordings and computational modeling can predict octopus arm movement.

Gedela N, Radawiec R, Salim S, Richie J, Chestek C, Draelos A Bioelectron Med. 2025; 11(1):4.

PMID: 39948616 PMC: 11827351. DOI: 10.1186/s42234-025-00166-9.


Current methods for detecting and assessing HIV-1 antibody resistance.

Odidika S, Pirkl M, Lengauer T, Schommers P Front Immunol. 2025; 15():1443377.

PMID: 39835119 PMC: 11743526. DOI: 10.3389/fimmu.2024.1443377.


AEmiGAP: AutoEncoder-Based miRNA-Gene Association Prediction Using Deep Learning Method.

Yoon S, Yoon H, Cho J, Lee K Int J Mol Sci. 2024; 25(23).

PMID: 39684787 PMC: 11641653. DOI: 10.3390/ijms252313075.


Single unit electrophysiology recordings and computational modeling can predict octopus arm movement.

Gedela N, Salim S, Radawiec R, Richie J, Chestek C, Draelos A bioRxiv. 2024; .

PMID: 39345497 PMC: 11430158. DOI: 10.1101/2024.09.13.612676.


PEZy-miner: An artificial intelligence driven approach for the discovery of plastic-degrading enzyme candidates.

Jiang R, Yue Z, Shang L, Wang D, Wei N Metab Eng Commun. 2024; 19:e00248.

PMID: 39310048 PMC: 11414552. DOI: 10.1016/j.mec.2024.e00248.


References
1.
Leslie C, Eskin E, Cohen A, Weston J, Noble W . Mismatch string kernels for discriminative protein classification. Bioinformatics. 2004; 20(4):467-76. DOI: 10.1093/bioinformatics/btg431. View

2.
Zaugg J, Gumulya Y, Malde A, Boden M . Learning epistatic interactions from sequence-activity data to predict enantioselectivity. J Comput Aided Mol Des. 2017; 31(12):1085-1096. DOI: 10.1007/s10822-017-0090-x. View

3.
Saladi S, Javed N, Muller A, Clemons Jr W . A statistical model for improved membrane protein expression using sequence-derived features. J Biol Chem. 2018; 293(13):4913-4927. PMC: 5880134. DOI: 10.1074/jbc.RA117.001052. View

4.
Alipanahi B, Delong A, Weirauch M, Frey B . Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol. 2015; 33(8):831-8. DOI: 10.1038/nbt.3300. View

5.
Asgari E, Mofrad M . Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PLoS One. 2015; 10(11):e0141287. PMC: 4640716. DOI: 10.1371/journal.pone.0141287. View