» Articles » PMID: 38486884

Detecting Anomalous Proteins Using Deep Representations

Overview
Specialty Biology
Date 2024 Mar 15
PMID 38486884
Authors
Affiliations
Soon will be listed here.
Abstract

Many advances in biomedicine can be attributed to identifying unusual proteins and genes. Many of these proteins' unique properties were discovered by manual inspection, which is becoming infeasible at the scale of modern protein datasets. Here, we propose to tackle this challenge using anomaly detection methods that automatically identify unexpected properties. We adopt a state-of-the-art anomaly detection paradigm from computer vision, to highlight unusual proteins. We generate meaningful representations without labeled inputs, using pretrained deep neural network models. We apply these protein language models (pLM) to detect anomalies in function, phylogenetic families, and segmentation tasks. We compute protein anomaly scores to highlight human prion-like proteins, distinguish viral proteins from their host proteome, and mark non-classical ion/metal binding proteins and enzymes. Other tasks concern segmentation of protein sequences into folded and unstructured regions. We provide candidates for rare functionality (e.g. prion proteins). Additionally, we show the anomaly score is useful in 3D folding-related segmentation. Our novel method shows improved performance over strong baselines and has objectively high performance across a variety of tasks. We conclude that the combination of pLM and anomaly detection techniques is a valid method for discovering a range of global and local protein characteristics.

Citing Articles

Fuzz Testing Molecular Representation Using Deep Variational Anomaly Generation.

Nogueira V, Sharma R, Guido R, Keiser M J Chem Inf Model. 2025; 65(4):1911-1927.

PMID: 39908426 PMC: 11863373. DOI: 10.1021/acs.jcim.4c01876.

References
1.
Singh U, Syrkin Wurtele E . How new genes are born. Elife. 2020; 9. PMC: 7030788. DOI: 10.7554/eLife.55136. View

2.
Halfmann R, Alberti S, Lindquist S . Prions, protein homeostasis, and phenotypic diversity. Trends Cell Biol. 2010; 20(3):125-33. PMC: 2846750. DOI: 10.1016/j.tcb.2009.12.003. View

3.
Chowdhury R, Bouatta N, Biswas S, Floristean C, Kharkar A, Roy K . Single-sequence protein structure prediction using a language model and deep learning. Nat Biotechnol. 2022; 40(11):1617-1623. PMC: 10440047. DOI: 10.1038/s41587-022-01432-w. View

4.
Uversky V, Dunker A . Understanding protein non-folding. Biochim Biophys Acta. 2010; 1804(6):1231-64. PMC: 2882790. DOI: 10.1016/j.bbapap.2010.01.017. View

5.
Martin D, Berriman M, Barton G . GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinformatics. 2004; 5:178. PMC: 535938. DOI: 10.1186/1471-2105-5-178. View