Non-synonymous Variations in Cancer and Their Effects on the Human Proteome: Workflow for NGS Data Biocuration and Proteome-wide Analysis of TCGA Data

Overview

Journal BMC Bioinformatics

Publisher Biomed Central

Specialty Biology

Date 2014 Jan 29

PMID 24467687

Citations 8

Authors

Charles Cole

Konstantinos Krampis

Konstantinos Karagiannis

Jonas S Almeida

William J Faison

Mona Motwani

Quan Wan

Anton Golikov

Yang Pan

Vahan Simonyan

Raja Mazumder

Affiliations

Soon will be listed here.

Abstract

Background: Next-generation sequencing (NGS) technologies have resulted in petabytes of scattered data, decentralized in archives, databases and sometimes in isolated hard-disks which are inaccessible for browsing and analysis. It is expected that curated secondary databases will help organize some of this Big Data thereby allowing users better navigate, search and compute on it.

Results: To address the above challenge, we have implemented a NGS biocuration workflow and are analyzing short read sequences and associated metadata from cancer patients to better understand the human variome. Curation of variation and other related information from control (normal tissue) and case (tumor) samples will provide comprehensive background information that can be used in genomic medicine research and application studies. Our approach includes a CloudBioLinux Virtual Machine which is used upstream of an integrated High-performance Integrated Virtual Environment (HIVE) that encapsulates Curated Short Read archive (CSR) and a proteome-wide variation effect analysis tool (SNVDis). As a proof-of-concept, we have curated and analyzed control and case breast cancer datasets from the NCI cancer genomics program - The Cancer Genome Atlas (TCGA). Our efforts include reviewing and recording in CSR available clinical information on patients, mapping of the reads to the reference followed by identification of non-synonymous Single Nucleotide Variations (nsSNVs) and integrating the data with tools that allow analysis of effect nsSNVs on the human proteome. Furthermore, we have also developed a novel phylogenetic analysis algorithm that uses SNV positions and can be used to classify the patient population. The workflow described here lays the foundation for analysis of short read sequence data to identify rare and novel SNVs that are not present in dbSNP and therefore provides a more comprehensive understanding of the human variome. Variation results for single genes as well as the entire study are available from the CSR website (http://hive.biochemistry.gwu.edu/dna.cgi?cmd=csr).

Conclusions: Availability of thousands of sequenced samples from patients provides a rich repository of sequence information that can be utilized to identify individual level SNVs and their effect on the human proteome beyond what the dbSNP database provides.

Citing Articles

Comprehensive Detection of Single Amino Acid Variants and Evaluation of Their Deleterious Potential in a PANC-1 Cell Line.

Tan Z, Zhu J, Stemmer P, Sun L, Yang Z, Schultz K J Proteome Res. 2020; 19(4):1635-1646.

PMID: 32058723 PMC: 7162681. DOI: 10.1021/acs.jproteome.9b00840.

Single Amino Acid Variant Profiles of Subpopulations in the MCF-7 Breast Cancer Cell Line.

Tan Z, Nie S, McDermott S, Wicha M, Lubman D J Proteome Res. 2017; 16(2):842-851.

PMID: 28076950 PMC: 5718353. DOI: 10.1021/acs.jproteome.6b00824.

Impact of germline and somatic missense variations on drug binding sites.

Yan C, Pattabiraman N, Goecks J, Lam P, Nayak A, Pan Y Pharmacogenomics J. 2016; 17(2):128-136.

PMID: 26810135 PMC: 5380835. DOI: 10.1038/tpj.2015.97.

CrossHub: a tool for multi-way analysis of The Cancer Genome Atlas (TCGA) in the context of gene expression regulation mechanisms.

Krasnov G, Dmitriev A, Melnikova N, Zaretsky A, Nasedkina T, Zasedatelev A Nucleic Acids Res. 2016; 44(7):e62.

PMID: 26773058 PMC: 4838350. DOI: 10.1093/nar/gkv1478.

Nonsynonymous Single-Nucleotide Variations on Some Posttranslational Modifications of Human Proteins and the Association with Diseases.

Sun B, Zhang M, Cui P, Li H, Jia J, Li Y Comput Math Methods Med. 2015; 2015:124630.

PMID: 26495027 PMC: 4606098. DOI: 10.1155/2015/124630.

References

Wu C, Nikolskaya A, Huang H, Yeh L, Natale D, Vinayaka C . PIRSF: family classification system at the Protein Information Resource. Nucleic Acids Res. 2003; 32(Database issue):D112-4. PMC: 308831. DOI: 10.1093/nar/gkh097. View

Forbes S, Tang G, Bindal N, Bamford S, Dawson E, Cole C . COSMIC (the Catalogue of Somatic Mutations in Cancer): a resource to investigate acquired mutations in human cancer. Nucleic Acids Res. 2009; 38(Database issue):D652-7. PMC: 2808858. DOI: 10.1093/nar/gkp995. View

Tanabe M, Kanehisa M . Using the KEGG database resource. Curr Protoc Bioinformatics. 2012; Chapter 1:1.12.1-1.12.43. DOI: 10.1002/0471250953.bi0112s38. View

Punta M, Coggill P, Eberhardt R, Mistry J, Tate J, Boursnell C . The Pfam protein families database. Nucleic Acids Res. 2011; 40(Database issue):D290-301. PMC: 3245129. DOI: 10.1093/nar/gkr1065. View

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N . The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009; 25(16):2078-9. PMC: 2723002. DOI: 10.1093/bioinformatics/btp352. View

Koo B, Hurskainen T, Mielke K, Aung P, Casey G, Autio-Harmainen H . ADAMTSL3/punctin-2, a gene frequently mutated in colorectal tumors, is widely expressed in normal and malignant epithelial cells, vascular endothelial cells and other cell types, and its mRNA is reduced in colon cancer. Int J Cancer. 2007; 121(8):1710-6. DOI: 10.1002/ijc.22882. View

Lee Y, Ise T, Ha D, Saint Fleur A, Hahn Y, Liu X . Evolution and expression of chimeric POTE-actin genes in the human genome. Proc Natl Acad Sci U S A. 2006; 103(47):17885-90. PMC: 1693842. DOI: 10.1073/pnas.0608344103. View

Negm R, Verma M, Srivastava S . The promise of biomarkers in cancer screening and detection. Trends Mol Med. 2002; 8(6):288-93. DOI: 10.1016/s1471-4914(02)02353-5. View

Penney K, Schumacher F, Kraft P, Mucci L, Sesso H, Ma J . Association of KLK3 (PSA) genetic variants with prostate cancer risk and PSA levels. Carcinogenesis. 2011; 32(6):853-9. PMC: 3106437. DOI: 10.1093/carcin/bgr050. View

10.

Lee T, Huang H, Hung J, Huang H, Yang Y, Wang T . dbPTM: an information repository of protein post-translational modification. Nucleic Acids Res. 2005; 34(Database issue):D622-7. PMC: 1347446. DOI: 10.1093/nar/gkj083. View

11.

Juliano R . Integrin signals and tumor growth control. Princess Takamatsu Symp. 1994; 24:118-24. View

12.

Kodama Y, Shumway M, Leinonen R . The Sequence Read Archive: explosive growth of sequencing data. Nucleic Acids Res. 2011; 40(Database issue):D54-6. PMC: 3245110. DOI: 10.1093/nar/gkr854. View

13.

LARKIN M, Blackshields G, Brown N, Chenna R, McGettigan P, McWilliam H . Clustal W and Clustal X version 2.0. Bioinformatics. 2007; 23(21):2947-8. DOI: 10.1093/bioinformatics/btm404. View

14.

Ng S, Turner E, Robertson P, Flygare S, Bigham A, Lee C . Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009; 461(7261):272-6. PMC: 2844771. DOI: 10.1038/nature08250. View

15.

Ng P, Henikoff S . SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003; 31(13):3812-4. PMC: 168916. DOI: 10.1093/nar/gkg509. View

16.

Fukata M, Kaibuchi K . Rho-family GTPases in cadherin-mediated cell-cell adhesion. Nat Rev Mol Cell Biol. 2001; 2(12):887-97. DOI: 10.1038/35103068. View

17.

Langmead B, Trapnell C, Pop M, Salzberg S . Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009; 10(3):R25. PMC: 2690996. DOI: 10.1186/gb-2009-10-3-r25. View

18.

Collins F, Guyer M, Charkravarti A . Variations on a theme: cataloging human DNA sequence variation. Science. 1997; 278(5343):1580-1. DOI: 10.1126/science.278.5343.1580. View

19.

Afgan E, Baker D, Coraor N, Chapman B, Nekrutenko A, Taylor J . Galaxy CloudMan: delivering cloud compute clusters. BMC Bioinformatics. 2011; 11 Suppl 12:S4. PMC: 3040530. DOI: 10.1186/1471-2105-11-S12-S4. View

20.

Kamphans T, Krawitz P . GeneTalk: an expert exchange platform for assessing rare sequence variants in personal genomes. Bioinformatics. 2012; 28(19):2515-6. PMC: 3463119. DOI: 10.1093/bioinformatics/bts462. View