» Articles » PMID: 39296929

A Generalized Protein Identification Method for Novel and Diverse Sequencing Technologies

Overview
Specialty Biology
Date 2024 Sep 19
PMID 39296929
Authors
Affiliations
Soon will be listed here.
Abstract

Protein sequencing is a rapidly evolving field with much progress towards the realization of a new generation of protein sequencers. The early devices, however, may not be able to reliably discriminate all 20 amino acids, resulting in a partial, noisy and possibly error-prone signature of a protein. Rather than achieving sequencing, these devices may aim to identify target proteins by comparing such signatures to databases of known proteins. However, there are no broadly applicable methods for this identification problem. Here, we devise a hidden Markov model method to study the generalized problem of protein identification from noisy signature data. Based on a hypothetical sequencing device that can simulate several novel technologies, we show that on the human protein database ( = 20 181) our method has a good performance under many different operating conditions such as various levels of signal resolvability, different numbers of discriminated amino acids, sequence fragments, and insertion and deletion error rates. Our results demonstrate the possibility of protein identification with high accuracy on many early experimental devices. We anticipate our method to be applicable for a wide range of protein sequencing devices in the future.

References
1.
Zhang H, Li H, Jain C, Cheng H, Au K, Li H . Real-time mapping of nanopore raw signals. Bioinformatics. 2021; 37(Suppl_1):i477-i483. PMC: 8336444. DOI: 10.1093/bioinformatics/btab264. View

2.
Zhang S, Huang G, Abraham Versloot R, Bruininks B, de Souza P, Marrink S . Bottom-up fabrication of a proteasome-nanopore that unravels and processes single proteins. Nat Chem. 2021; 13(12):1192-1199. PMC: 7612055. DOI: 10.1038/s41557-021-00824-w. View

3.
Smith M, Simpson Z, Marcotte E . Amino acid sequence assignment from single molecule peptide sequencing data using a two-stage classifier. PLoS Comput Biol. 2023; 19(5):e1011157. PMC: 10256185. DOI: 10.1371/journal.pcbi.1011157. View

4.
Schreiber J, Karplus K . Analysis of nanopore data using hidden Markov models. Bioinformatics. 2015; 31(12):1897-903. PMC: 4553831. DOI: 10.1093/bioinformatics/btv046. View

5.
Neumann D, Reddy A, Ben-Hur A . RODAN: a fully convolutional architecture for basecalling nanopore RNA sequencing data. BMC Bioinformatics. 2022; 23(1):142. PMC: 9020074. DOI: 10.1186/s12859-022-04686-y. View