» Articles » PMID: 20851208

Exposing the Cancer Genome Atlas As a SPARQL Endpoint

Overview
Journal J Biomed Inform
Publisher Elsevier
Date 2010 Sep 21
PMID 20851208
Citations 12
Authors
Affiliations
Soon will be listed here.
Abstract

The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional effort to characterize several types of cancer. Datasets from biomedical domains such as TCGA present a particularly challenging task for those interested in dynamically aggregating its results because the data sources are typically both heterogeneous and distributed. The Linked Data best practices offer a solution to integrate and discover data with those characteristics, namely through exposure of data as Web services supporting SPARQL, the Resource Description Framework query language. Most SPARQL endpoints, however, cannot easily be queried by data experts. Furthermore, exposing experimental data as SPARQL endpoints remains a challenging task because, in most cases, data must first be converted to Resource Description Framework triples. In line with those requirements, we have developed an infrastructure to expose clinical, demographic and molecular data elements generated by TCGA as a SPARQL endpoint by assigning elements to entities of the Simple Sloppy Semantic Database (S3DB) management model. All components of the infrastructure are available as independent Representational State Transfer (REST) Web services to encourage reusability, and a simple interface was developed to automatically assemble SPARQL queries by navigating a representation of the TCGA domain. A key feature of the proposed solution that greatly facilitates assembly of SPARQL queries is the distinction between the TCGA domain descriptors and data elements. Furthermore, the use of the S3DB management model as a mediator enables queries to both public and protected data without the need for prior submission to a single data source.

Citing Articles

CrossLink: a novel method for cross-condition classification of cancer subtypes.

Ma C, Sastry K, Flore M, Gehani S, Al-Bozom I, Feng Y BMC Genomics. 2016; 17 Suppl 7:549.

PMID: 27556419 PMC: 5001207. DOI: 10.1186/s12864-016-2903-z.


kpath: integration of metabolic pathway linked data.

Navas-Delgado I, Garcia-Godoy M, Lopez-Camacho E, Rybinski M, Reyes-Palomares A, Medina M Database (Oxford). 2015; 2015:bav053.

PMID: 26055101 PMC: 4460419. DOI: 10.1093/database/bav053.


Next generation distributed computing for cancer research.

Agarwal P, Owzar K Cancer Inform. 2015; 13(Suppl 7):97-109.

PMID: 25983539 PMC: 4412427. DOI: 10.4137/CIN.S16344.


TopFed: TCGA tailored federated query processing and linking to LOD.

Saleem M, Padmanabhuni S, Ngonga Ngomo A, Iqbal A, Almeida J, Decker S J Biomed Semantics. 2015; 5:47.

PMID: 25937882 PMC: 4417511. DOI: 10.1186/2041-1480-5-47.


QMachine: commodity supercomputing in web browsers.

Wilkinson S, Almeida J BMC Bioinformatics. 2014; 15:176.

PMID: 24913605 PMC: 4063228. DOI: 10.1186/1471-2105-15-176.


References
1.
Belleau F, Nolin M, Tourigny N, Rigault P, Morissette J . Bio2RDF: towards a mashup to build bioinformatics knowledge systems. J Biomed Inform. 2008; 41(5):706-16. DOI: 10.1016/j.jbi.2008.03.004. View

2.
Shironoshita E, Jean-Mary Y, Bradley R, Kabuka M . semCDI: a query formulation for semantic data integration in caBIG. J Am Med Inform Assoc. 2008; 15(4):559-68. PMC: 2442262. DOI: 10.1197/jamia.M2732. View

3.
McCusker J, Phillips J, Gonzalez Beltran A, Finkelstein A, Krauthammer M . Semantic web data warehousing for caGrid. BMC Bioinformatics. 2009; 10 Suppl 10:S2. PMC: 2755823. DOI: 10.1186/1471-2105-10-S10-S2. View

4.
Goble C, Stevens R . State of the nation in data integration for bioinformatics. J Biomed Inform. 2008; 41(5):687-93. DOI: 10.1016/j.jbi.2008.01.008. View

5.
Rubin D, Lewis S, Mungall C, Misra S, Westerfield M, Ashburner M . National Center for Biomedical Ontology: advancing biomedicine through structured organization of scientific knowledge. OMICS. 2006; 10(2):185-98. DOI: 10.1089/omi.2006.10.185. View