Has the Yo-yo Stopped? An Assessment of Human Protein-coding Gene Number

Overview

Journal Proteomics

Date 2004 Jun 3

PMID 15174140

Citations 22

Authors

Christopher Southan

Affiliations

Soon will be listed here.

Abstract

Since the identification of approximately 25,000 proteins from the draft human genome assembly in 2001, estimates of the total have oscillated between 30,000 and 70,000. The recently announced genome closure has not generated a consensus gene count despite this being a key parameter for many areas of biology including drug target discovery and characterization of the human proteome. Contrary to earlier predictions of constitutive under-detection for eukaryotic genes, the latest model organism updates have produced minor increases in the worm but fly and yeast gene numbers have decreased. The postdraft, precompletion interval has produced large increases in human transcript coverage, continuous improvements in genome assembly and refinements in automated genomic annotation. Notably these enhancements have resulted in an Ensembl human protein-coding gene number of 22,184, a decrease of 1862 since the first release. Longitudinal database surveys indicate that redundancy-reduced human mRNA and protein collections are flattening out at approximately 28,000, although Ensembl maps approximately 20,000 known sequences. Observations suggest high-throughput cloning projects are predominantly extending known genes or sampling new splice forms and novel protein discovery has slowed to a trickle. The hypothesis that substantial numbers of short proteins remain experimentally and computationally undetected in mammalian genomes is neither supported by sequence data nor by the extensive homology between mouse and human proteins. Aggregating the independent annotations for complete transcripts from seven completed human chromosomes extrapolates to approximately 25,000 genes. The inclusion of partial putative genes would increase this to above 30,000 but recent data suggest these represent predominantly nonprotein-coding transcripts. Mass spectrometry-based proteomics has already verified more than 10% of human genes but has not identified significant numbers of unpredicted proteins. The available data are thus converging to a basal protein-coding gene number well below 30,000, which could even be as low as 25,000.

Citing Articles

A deep audit of the PeptideAtlas database uncovers evidence for unannotated coding genes and aberrant translation.

Rodriguez J, Maquedano M, Cerdan-Velez D, Calvo E, Vazquez J, Tress M bioRxiv. 2024; .

PMID: 39605392 PMC: 11601488. DOI: 10.1101/2024.11.14.623419.

Evidence for widespread translation of 5' untranslated regions.

Rodriguez J, Abascal F, Cerdan-Velez D, Gomez L, Vazquez J, Tress M Nucleic Acids Res. 2024; 52(14):8112-8126.

PMID: 38953162 PMC: 11317171. DOI: 10.1093/nar/gkae571.

Tuberculous Granuloma: Emerging Insights From Proteomics and Metabolomics.

Sholeye A, Williams A, Loots D, van Furth A, van der Kuip M, Mason S Front Neurol. 2022; 13:804838.

PMID: 35386409 PMC: 8978302. DOI: 10.3389/fneur.2022.804838.

Evolving the Behavior of Machines: From Micro to Macroevolution.

Mouret J iScience. 2020; 23(11):101731.

PMID: 33225243 PMC: 7662872. DOI: 10.1016/j.isci.2020.101731.

Loose ends: almost one in five human genes still have unresolved coding status.

Abascal F, Juan D, Jungreis I, Kellis M, Martinez L, Rigau M Nucleic Acids Res. 2018; 46(14):7070-7084.

PMID: 29982784 PMC: 6101605. DOI: 10.1093/nar/gky587.