» Articles » PMID: 17355171

The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families

Abstract

Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.

Citing Articles

From nets to networks: tools for deciphering phytoplankton metabolic interactions within communities and their global significance.

Nef C, Pierella Karlusich J, Bowler C Philos Trans R Soc Lond B Biol Sci. 2024; 379(1909):20230172.

PMID: 39034691 PMC: 11293860. DOI: 10.1098/rstb.2023.0172.


Diversity and potential host-interactions of viruses inhabiting deep-sea seamount sediments.

Yu M, Zhang M, Zeng R, Cheng R, Zhang R, Hou Y Nat Commun. 2024; 15(1):3228.

PMID: 38622147 PMC: 11018836. DOI: 10.1038/s41467-024-47600-1.


Marine picoplankton metagenomes and MAGs from eleven vertical profiles obtained by the Malaspina Expedition.

Sanchez P, Coutinho F, Sebastian M, Pernice M, Rodriguez-Martinez R, Salazar G Sci Data. 2024; 11(1):154.

PMID: 38302528 PMC: 10834958. DOI: 10.1038/s41597-024-02974-1.


Seasonal patterns in microbial carbon and iron transporter expression in the Southern Ocean.

Debeljak P, Bayer B, Sun Y, Herndl G, Obernosterer I Microbiome. 2023; 11(1):187.

PMID: 37596690 PMC: 10439609. DOI: 10.1186/s40168-023-01600-3.


Identification of microbial metabolic functional guilds from large genomic datasets.

Reynolds R, Hyun S, Tully B, Bien J, Levine N Front Microbiol. 2023; 14:1197329.

PMID: 37455725 PMC: 10348482. DOI: 10.3389/fmicb.2023.1197329.


References
1.
Wilson G, Bertrand N, Patel Y, Hughes J, Feil E, Field D . Orphans as taxonomically restricted and ecologically important genes. Microbiology (Reading). 2005; 151(Pt 8):2499-2501. DOI: 10.1099/mic.0.28146-0. View

2.
Bowers P, Pellegrini M, Thompson M, Fierro J, Yeates T, Eisenberg D . Prolinks: a database of protein functional linkages derived from coevolution. Genome Biol. 2004; 5(5):R35. PMC: 416471. DOI: 10.1186/gb-2004-5-5-r35. View

3.
Kunin V, Cases I, Enright A, de Lorenzo V, Ouzounis C . Myriads of protein families, and still counting. Genome Biol. 2003; 4(2):401. PMC: 151299. DOI: 10.1186/gb-2003-4-2-401. View

4.
Birney E, Andrews D, Bevan P, Caccamo M, Cameron G, Chen Y . Ensembl 2004. Nucleic Acids Res. 2003; 32(Database issue):D468-70. PMC: 308772. DOI: 10.1093/nar/gkh038. View

5.
Yang Z, Nielsen R, Goldman N, Pedersen A . Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics. 2000; 155(1):431-49. PMC: 1461088. DOI: 10.1093/genetics/155.1.431. View