Searching for Hypothetical Proteins: Theory and Practice Based Upon Original Data and Literature
Overview
Affiliations
A large part of mammalian proteomes is represented by hypothetical proteins (HP), i.e. proteins predicted from nucleic acid sequences only and protein sequences with unknown function. Databases are far from being complete and errors are expected. The legion of HP is awaiting experiments to show their existence at the protein level and subsequent bioinformatic handling in order to assign proteins a tentative function is mandatory. Two-dimensional gel-electrophoresis with subsequent mass spectrometrical identification of protein spots is an appropriate tool to search for HP in the high-throughput mode. Spots are identified by MS or by MS/MS measurements (MALDI-TOF, MALDI-TOF-TOF) and subsequent software as e.g. Mascot or ProFound. In many cases proteins can thus be unambiguously identified and characterised; if this is not the case, de novo sequencing or Q-TOF analysis is warranted. If the protein is not identified, the sequence is being sent to databases for BLAST searches to determine identities/similarities or homologies to known proteins. If no significant identity to known structures is observed, the protein sequence is examined for the presence of functional domains (databases PROSITE, PRINTS, InterPro, ProDom, Pfam and SMART), subjected to searches for motifs (ELM) and finally protein-protein interaction databases (InterWeaver, STRING) are consulted or predictions from conformations are performed. We here provide information about hypothetical proteins in terms of protein chemical analysis, independent of antibody availability and specificity and bioinformatic handling to contribute to the extension/completion of protein databases and include original work on HP in the brain to illustrate the processes of HP identification and functional assignment.
Tenginakai P, Bhor S, Waasia F, Sharma S, Dinesh S Biotechnol Lett. 2024; 47(1):13.
PMID: 39702823 DOI: 10.1007/s10529-024-03546-4.
Singh L, Karthikeyan S, Thakur K Protein Sci. 2024; 33(4):e4943.
PMID: 38501428 PMC: 10949319. DOI: 10.1002/pro.4943.
Chakma V, Barman D, Das S, Hossain A, Momin M, Tasneem M J Genet Eng Biotechnol. 2023; 21(1):135.
PMID: 37995054 PMC: 10667181. DOI: 10.1186/s43141-023-00613-7.
Masum M, Rajia S, Bristi U, Akter M, Amin M, Shishir T Bioinform Biol Insights. 2023; 17:11779322231184024.
PMID: 37424709 PMC: 10328030. DOI: 10.1177/11779322231184024.
AnnotaPipeline: An integrated tool to annotate eukaryotic proteins using multi-omics data.
Maia G, Benetti Filho V, Kawagoe E, Soratto T, Moreira R, Grisard E Front Genet. 2022; 13:1020100.
PMID: 36482896 PMC: 9723129. DOI: 10.3389/fgene.2022.1020100.