Functional Coverage of the Human Genome by Existing Structures, Structural Genomics Targets, and Homology Models

Overview

Journal PLoS Comput Biol

Specialty Biology

Date 2005 Aug 25

PMID 16118666

Citations 40

Authors

Lei Xie

Philip E Bourne

Affiliations

Soon will be listed here.

Abstract

The bias in protein structure and function space resulting from experimental limitations and targeting of particular functional classes of proteins by structural biologists has long been recognized, but never continuously quantified. Using the Enzyme Commission and the Gene Ontology classifications as a reference frame, and integrating structure data from the Protein Data Bank (PDB), target sequences from the structural genomics projects, structure homology derived from the SUPERFAMILY database, and genome annotations from Ensembl and NCBI, we provide a quantified view, both at the domain and whole-protein levels, of the current and projected coverage of protein structure and function space relative to the human genome. Protein structures currently provide at least one domain that covers 37% of the functional classes identified in the genome; whole structure coverage exists for 25% of the genome. If all the structural genomics targets were solved (twice the current number of structures in the PDB), it is estimated that structures of one domain would cover 69% of the functional classes identified and complete structure coverage would be 44%. Homology models from existing experimental structures extend the 37% coverage to 56% of the genome as single domains and 25% to 31% for complete structures. Coverage from homology models is not evenly distributed by protein family, reflecting differing degrees of sequence and structure divergence within families. While these data provide coverage, conversely, they also systematically highlight functional classes of proteins for which structures should be determined. Current key functional families without structure representation are highlighted here; updated information on the "most wanted list" that should be solved is available on a weekly basis from http://function.rcsb.org:8080/pdb/function_distribution/index.html.

Citing Articles

Computational approaches for molecular characterization and structure-based functional elucidation of a hypothetical protein from Mycobacterium tuberculosis.

Saikat A Genomics Inform. 2023; 21(2):e25.

PMID: 37415455 PMC: 10326535. DOI: 10.5808/gi.23001.

Network Pharmacology- and Molecular Docking-Based Identification of Potential Phytocompounds from in the Treatment of Inflammation.

Obaidullah A, Alanazi M, Alsaif N, Alanazi A, Albassam H, Az A Evid Based Complement Alternat Med. 2022; 2022:8037488.

PMID: 35140801 PMC: 8820870. DOI: 10.1155/2022/8037488.

Heterogeneous Multi-Layered Network Model for Omics Data Integration and Analysis.

Lee B, Zhang S, Poleksic A, Xie L Front Genet. 2020; 10:1381.

PMID: 32063919 PMC: 6997577. DOI: 10.3389/fgene.2019.01381.

Direct Binding of the Flexible C-Terminal Segment of Periaxin to β4 Integrin Suggests a Molecular Basis for CMT4F.

Raasakka A, Linxweiler H, Brophy P, Sherman D, Kursula P Front Mol Neurosci. 2019; 12:84.

PMID: 31024253 PMC: 6465933. DOI: 10.3389/fnmol.2019.00084.

STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets.

Szklarczyk D, Gable A, Lyon D, Junge A, Wyder S, Huerta-Cepas J Nucleic Acids Res. 2018; 47(D1):D607-D613.

PMID: 30476243 PMC: 6323986. DOI: 10.1093/nar/gky1131.

References

Rost B . Enzyme function less conserved than anticipated. J Mol Biol. 2002; 318(2):595-608. DOI: 10.1016/S0022-2836(02)00016-5. View

Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L . The Ensembl genome database project. Nucleic Acids Res. 2001; 30(1):38-41. PMC: 99161. DOI: 10.1093/nar/30.1.38. View

Lundstrom K . Structural genomics on membrane proteins: the MePNet approach. Curr Opin Drug Discov Devel. 2004; 7(3):342-6. View

Liu J, Rost B . Target space for structural genomics revisited. Bioinformatics. 2002; 18(7):922-33. DOI: 10.1093/bioinformatics/18.7.922. View

Carvalho A, Sanz L, Barettino D, Romero A, Calvete J, Romao M . Crystal structure of a prostate kallikrein isolated from stallion seminal plasma: a homologue of human PSA. J Mol Biol. 2002; 322(2):325-37. DOI: 10.1016/s0022-2836(02)00705-2. View

Westbrook J, Feng Z, Chen L, Yang H, Berman H . The Protein Data Bank and structural genomics. Nucleic Acids Res. 2003; 31(1):489-91. PMC: 165515. DOI: 10.1093/nar/gkg068. View

Lupas A, Van Dyke M, Stock J . Predicting coiled coils from protein sequences. Science. 1991; 252(5009):1162-4. DOI: 10.1126/science.252.5009.1162. View

Wootton J . Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem. 1994; 18(3):269-85. DOI: 10.1016/0097-8485(94)85023-2. View

Murzin A, Brenner S, Hubbard T, Chothia C . SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995; 247(4):536-40. DOI: 10.1006/jmbi.1995.0159. View

10.

Brenner S, Chothia C, Hubbard T . Population statistics of protein structures: lessons from structural classifications. Curr Opin Struct Biol. 1997; 7(3):369-76. DOI: 10.1016/s0959-440x(97)80054-1. View

11.

Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W . Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997; 25(17):3389-402. PMC: 146917. DOI: 10.1093/nar/25.17.3389. View

12.

Martin A, Orengo C, Hutchinson E, Jones S, Karmirantzou M, Laskowski R . Protein folds and functions. Structure. 1998; 6(7):875-84. DOI: 10.1016/s0969-2126(98)00089-6. View

13.

Gerstein M . How representative are the known structures of the proteins in a complete genome? A comprehensive structural census. Fold Des. 1999; 3(6):497-512. DOI: 10.1016/S1359-0278(98)00066-2. View

14.

Eddy S . Profile hidden Markov models. Bioinformatics. 1999; 14(9):755-63. DOI: 10.1093/bioinformatics/14.9.755. View

15.

Hegyi H, Gerstein M . The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. J Mol Biol. 1999; 288(1):147-64. DOI: 10.1006/jmbi.1999.2661. View

16.

Chen L, Oughtred R, Berman H, Westbrook J . TargetDB: a target registration database for structural genomics projects. Bioinformatics. 2004; 20(16):2860-2. DOI: 10.1093/bioinformatics/bth300. View

17.

Shuman S, Lima C . The polynucleotide ligase and RNA capping enzyme superfamily of covalent nucleotidyltransferases. Curr Opin Struct Biol. 2004; 14(6):757-64. DOI: 10.1016/j.sbi.2004.10.006. View

18.

Chandonia J, Brenner S . Implications of structural genomics target selection strategies: Pfam5000, whole genome, and random approaches. Proteins. 2004; 58(1):166-79. DOI: 10.1002/prot.20298. View

19.

Bairoch A, Apweiler R, Wu C, Barker W, Boeckmann B, Ferro S . The Universal Protein Resource (UniProt). Nucleic Acids Res. 2004; 33(Database issue):D154-9. PMC: 540024. DOI: 10.1093/nar/gki070. View

20.

Deshpande N, Addess K, Bluhm W, Merino-Ott J, Townsend-Merino W, Zhang Q . The RCSB Protein Data Bank: a redesigned query system and relational database based on the mmCIF schema. Nucleic Acids Res. 2004; 33(Database issue):D233-7. PMC: 540011. DOI: 10.1093/nar/gki057. View