A Framework for Organizing Cancer-related Variations from Existing Databases, Publications and NGS Data Using a High-performance Integrated Virtual Environment (HIVE)

Overview

Journal Database (Oxford)

Specialty Biology

Date 2014 Mar 27

PMID 24667251

Citations 39

Authors

Tsung-Jung Wu

Amirhossein Shamsaddini

Yang Pan

Krista Smith

Daniel J Crichton

Vahan Simonyan

Raja Mazumder

Affiliations

Soon will be listed here.

Abstract

Years of sequence feature curation by UniProtKB/Swiss-Prot, PIR-PSD, NCBI-CDD, RefSeq and other database biocurators has led to a rich repository of information on functional sites of genes and proteins. This information along with variation-related annotation can be used to scan human short sequence reads from next-generation sequencing (NGS) pipelines for presence of non-synonymous single-nucleotide variations (nsSNVs) that affect functional sites. This and similar workflows are becoming more important because thousands of NGS data sets are being made available through projects such as The Cancer Genome Atlas (TCGA), and researchers want to evaluate their biomarkers in genomic data. BioMuta, an integrated sequence feature database, provides a framework for automated and manual curation and integration of cancer-related sequence features so that they can be used in NGS analysis pipelines. Sequence feature information in BioMuta is collected from the Catalogue of Somatic Mutations in Cancer (COSMIC), ClinVar, UniProtKB and through biocuration of information available from publications. Additionally, nsSNVs identified through automated analysis of NGS data from TCGA are also included in the database. Because of the petabytes of data and information present in NGS primary repositories, a platform HIVE (High-performance Integrated Virtual Environment) for storing, analyzing, computing and curating NGS data and associated metadata has been developed. Using HIVE, 31 979 nsSNVs were identified in TCGA-derived NGS data from breast cancer patients. All variations identified through this process are stored in a Curated Short Read archive, and the nsSNVs from the tumor samples are included in BioMuta. Currently, BioMuta has 26 cancer types with 13 896 small-scale and 308 986 large-scale study-derived variations. Integration of variation data allows identifications of novel or common nsSNVs that can be prioritized in validation studies. Database URL: BioMuta: http://hive.biochemistry.gwu.edu/tools/biomuta/index.php; CSR: http://hive.biochemistry.gwu.edu/dna.cgi?cmd=csr; HIVE: http://hive.biochemistry.gwu.edu.

Citing Articles

Zhou X, Yao L, Zhou X, Cong R, Luan J, Wei X Front Oncol. 2022; 12:837155.

PMID: 35860590 PMC: 9291251. DOI: 10.3389/fonc.2022.837155.

Human CEACAM1 N-domain dimerization is independent from glycan modifications.

Dufrisne M, Swope N, Kieber M, Yang J, Han J, Li J Structure. 2022; 30(5):658-670.e5.

PMID: 35219398 PMC: 9081242. DOI: 10.1016/j.str.2022.02.003.

A Model for the Signal Initiation Complex Between Arrestin-3 and the Src Family Kinase Fgr.

Perez I, Berndt S, Agarwal R, Castro M, Vishnivetskiy S, Smith J J Mol Biol. 2021; 434(2):167400.

PMID: 34902430 PMC: 8752512. DOI: 10.1016/j.jmb.2021.167400.

Glycosylation of Serum Clusterin in Wild-Type Transthyretin-Associated (ATTRwt) Amyloidosis: A Study of Disease-Associated Compositional Features Using Mass Spectrometry Analyses.

Torres-Arancivia C, Chang D, Hackett W, Zaia J, Connors L Biochemistry. 2020; 59(45):4367-4378.

PMID: 33141553 PMC: 8082438. DOI: 10.1021/acs.biochem.0c00590.

Structural characterization of the ICOS/ICOS-L immune complex reveals high molecular mimicry by therapeutic antibodies.

Rujas E, Cui H, Sicard T, Semesi A, Julien J Nat Commun. 2020; 11(1):5066.

PMID: 33033255 PMC: 7545189. DOI: 10.1038/s41467-020-18828-4.

References

Langmead B, Trapnell C, Pop M, Salzberg S . Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009; 10(3):R25. PMC: 2690996. DOI: 10.1186/gb-2009-10-3-r25. View

Lander E . Initial impact of the sequencing of the human genome. Nature. 2011; 470(7333):187-97. DOI: 10.1038/nature09792. View

Wu C, Nikolskaya A, Huang H, Yeh L, Natale D, Vinayaka C . PIRSF: family classification system at the Protein Information Resource. Nucleic Acids Res. 2003; 32(Database issue):D112-4. PMC: 308831. DOI: 10.1093/nar/gkh097. View

. Integrated genomic analyses of ovarian carcinoma. Nature. 2011; 474(7353):609-15. PMC: 3163504. DOI: 10.1038/nature10166. View

Pruitt K, Tatusova T, Brown G, Maglott D . NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res. 2011; 40(Database issue):D130-5. PMC: 3245008. DOI: 10.1093/nar/gkr1079. View

. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2011; 40(Database issue):D71-5. PMC: 3245120. DOI: 10.1093/nar/gkr981. View

Mi H, Thomas P . PANTHER pathway: an ontology-based pathway database coupled with data analysis tools. Methods Mol Biol. 2009; 563:123-40. PMC: 6608593. DOI: 10.1007/978-1-60761-175-2_7. View

Tanabe M, Kanehisa M . Using the KEGG database resource. Curr Protoc Bioinformatics. 2012; Chapter 1:1.12.1-1.12.43. DOI: 10.1002/0471250953.bi0112s38. View

Kolker E, Stewart E, Ozdemir V . Opportunities and challenges for the life sciences community. OMICS. 2012; 16(3):138-47. PMC: 3300061. DOI: 10.1089/omi.2011.0152. View

10.

Dingerdissen H, Motwani M, Karagiannis K, Simonyan V, Mazumder R . Proteome-wide analysis of nonsynonymous single-nucleotide variations in active sites of human proteins. FEBS J. 2013; 280(6):1542-62. DOI: 10.1111/febs.12155. View

11.

Marx V . Biology: The big challenges of big data. Nature. 2013; 498(7453):255-60. DOI: 10.1038/498255a. View

12.

MacArthur J, Morales J, Tully R, Astashyn A, Gil L, Bruford E . Locus Reference Genomic: reference sequences for the reporting of clinically relevant sequence variants. Nucleic Acids Res. 2013; 42(Database issue):D873-8. PMC: 3965024. DOI: 10.1093/nar/gkt1198. View

13.

Gray K, Daugherty L, Gordon S, Seal R, Wright M, Bruford E . Genenames.org: the HGNC resources in 2013. Nucleic Acids Res. 2012; 41(Database issue):D545-52. PMC: 3531211. DOI: 10.1093/nar/gks1066. View

14.

Stenson P, Mort M, Ball E, Shaw K, Phillips A, Cooper D . The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Hum Genet. 2013; 133(1):1-9. PMC: 3898141. DOI: 10.1007/s00439-013-1358-4. View

15.

Forbes S, Bhamra G, Bamford S, Dawson E, Kok C, Clements J . The Catalogue of Somatic Mutations in Cancer (COSMIC). Curr Protoc Hum Genet. 2008; Chapter 10:Unit 10.11. PMC: 2705836. DOI: 10.1002/0471142905.hg1011s57. View

16.

Stephens P, Tarpey P, Davies H, Van Loo P, Greenman C, Wedge D . The landscape of cancer genes and mutational processes in breast cancer. Nature. 2012; 486(7403):400-4. PMC: 3428862. DOI: 10.1038/nature11017. View

17.

Li H, Durbin R . Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009; 25(14):1754-60. PMC: 2705234. DOI: 10.1093/bioinformatics/btp324. View

18.

Dulak A, Stojanov P, Peng S, Lawrence M, Fox C, Stewart C . Exome and whole-genome sequencing of esophageal adenocarcinoma identifies recurrent driver events and mutational complexity. Nat Genet. 2013; 45(5):478-86. PMC: 3678719. DOI: 10.1038/ng.2591. View

19.

Sayers E, Barrett T, Benson D, Bolton E, Bryant S, Canese K . Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2011; 40(Database issue):D13-25. PMC: 3245031. DOI: 10.1093/nar/gkr1184. View

20.

Liu X, Jian X, Boerwinkle E . dbNSFP v2.0: a database of human non-synonymous SNVs and their functional predictions and annotations. Hum Mutat. 2013; 34(9):E2393-402. PMC: 4109890. DOI: 10.1002/humu.22376. View