Implementing a Genomic Data Management System Using IRODS in the Wellcome Trust Sanger Institute

Overview

Journal BMC Bioinformatics

Publisher Biomed Central

Specialty Biology

Date 2011 Sep 13

PMID 21906284

Citations 13

Authors

Gen-Tao Chiang

Peter Clapham

Guoying Qi

Kevin Sale

Guy Coates

Affiliations

Soon will be listed here.

Abstract

Background: Increasingly large amounts of DNA sequencing data are being generated within the Wellcome Trust Sanger Institute (WTSI). The traditional file system struggles to handle these increasing amounts of sequence data. A good data management system therefore needs to be implemented and integrated into the current WTSI infrastructure. Such a system enables good management of the IT infrastructure of the sequencing pipeline and allows biologists to track their data.

Results: We have chosen a data grid system, iRODS (Rule-Oriented Data management systems), to act as the data management system for the WTSI. iRODS provides a rule-based system management approach which makes data replication much easier and provides extra data protection. Unlike the metadata provided by traditional file systems, the metadata system of iRODS is comprehensive and allows users to customize their own application level metadata. Users and IT experts in the WTSI can then query the metadata to find and track data.The aim of this paper is to describe how we designed and used (from both system and user viewpoints) iRODS as a data management system. Details are given about the problems faced and the solutions found when iRODS was implemented. A simple use case describing how users within the WTSI use iRODS is also introduced.

Conclusions: iRODS has been implemented and works as the production system for the sequencing pipeline of the WTSI. Both biologists and IT experts can now track and manage data, which could not previously be achieved. This novel approach allows biologists to define their own metadata and query the genomic data using those metadata.

Citing Articles

Current state of data stewardship tools in life science.

Aksenova A, Johny A, Adams T, Gribbon P, Jacobs M, Hofmann-Apitius M Front Big Data. 2024; 7:1428568.

PMID: 39351001 PMC: 11439729. DOI: 10.3389/fdata.2024.1428568.

Journeying towards best practice data management in biodiversity genomics.

Forsdick N, Wold J, Angelo A, Bissey F, Hart J, Head M Mol Ecol Resour. 2023; 25(2):e13880.

PMID: 37873890 PMC: 11696474. DOI: 10.1111/1755-0998.13880.

SODAR: managing multiomics study data and metadata.

Nieminen M, Stolpe O, Kuhring M, Weiner J, Pett P, Beule D Gigascience. 2023; 12.

PMID: 37498129 PMC: 10373112. DOI: 10.1093/gigascience/giad052.

Data Management Plans in the genomics research revolution of Africa: Challenges and recommendations.

Fadlelmola F, Zass L, Chaouch M, Samtal C, Ras V, Kumuthini J J Biomed Inform. 2021; 122:103900.

PMID: 34506960 PMC: 9123155. DOI: 10.1016/j.jbi.2021.103900.

Named Data Networking for Genomics Data Management and Integrated Workflows.

Ogle C, Reddick D, McKnight C, Biggs T, Pauly R, Ficklin S Front Big Data. 2021; 4:582468.

PMID: 33748749 PMC: 7968724. DOI: 10.3389/fdata.2021.582468.

References

Cuff J, Coates G, Cutts T, Rae M . The Ensembl computing architecture. Genome Res. 2004; 14(5):971-5. PMC: 479128. DOI: 10.1101/gr.1866304. View

Abecasis G, Altshuler D, Auton A, Brooks L, Durbin R, Gibbs R . A map of human genome variation from population-scale sequencing. Nature. 2010; 467(7319):1061-73. PMC: 3042601. DOI: 10.1038/nature09534. View

Salje E, Artacho E, Austen K, Bruin R, Calleja M, Chappell H . eScience for molecular-scale simulations and the eMinerals project. Philos Trans A Math Phys Eng Sci. 2008; 367(1890):967-85. DOI: 10.1098/rsta.2008.0195. View

Bell G, Hey T, Szalay A . Computer science. Beyond the data deluge. Science. 2009; 323(5919):1297-8. DOI: 10.1126/science.1170411. View

Mardis E . A decade's perspective on DNA sequencing technology. Nature. 2011; 470(7333):198-203. DOI: 10.1038/nature09796. View