» Articles » PMID: 34890448

Linking Big Biomedical Datasets to Modular Analysis with Portable Encapsulated Projects

Overview
Journal Gigascience
Specialties Biology
Genetics
Date 2021 Dec 10
PMID 34890448
Citations 14
Authors
Affiliations
Soon will be listed here.
Abstract

Background: Organizing and annotating biological sample data is critical in data-intensive bioinformatics. Unfortunately, metadata formats from a data provider are often incompatible with requirements of a processing tool. There is no broadly accepted standard to organize metadata across biological projects and bioinformatics tools, restricting the portability and reusability of both annotated datasets and analysis software.

Results: To address this, we present the Portable Encapsulated Project (PEP) specification, a formal specification for biological sample metadata structure. The PEP specification accommodates typical features of data-intensive bioinformatics projects with many biological samples. In addition to standardization, the PEP specification provides descriptors and modifiers for project-level and sample-level metadata, which improve portability across both computing environments and data processing tools. PEPs include a schema validator framework, allowing formal definition of required metadata attributes for data analysis broadly. We have implemented packages for reading PEPs in both Python and R to provide a language-agnostic interface for organizing project metadata.

Conclusions: The PEP specification is an important step toward unifying data annotation and processing tools in data-intensive biological research projects. Links to tools and documentation are available at http://pep.databio.org/.

Citing Articles

FAIRSCAPE: An Evolving AI-readiness Framework for Biomedical Research.

Al Manir S, Levinson M, Niestroy J, Churas C, Parker J, Clark T bioRxiv. 2025; .

PMID: 39763764 PMC: 11703166. DOI: 10.1101/2024.12.23.629818.


Expanding the genome information on Bacillales for biosynthetic gene cluster discovery.

Song L, Nielsen L, Xu X, Mohite O, Nuhamunada M, Xu Z Sci Data. 2024; 11(1):1267.

PMID: 39572589 PMC: 11582795. DOI: 10.1038/s41597-024-04118-x.


Adaptive adjustment of significance thresholds produces large gains in microbial gene annotations and metabolic insights.

Kananen K, Veseli I, Quiles Perez C, Miller S, Eren A, Bradley P bioRxiv. 2024; .

PMID: 39005339 PMC: 11245035. DOI: 10.1101/2024.07.03.601779.


PEPhub: a database, web interface, and API for editing, sharing, and validating biological sample metadata.

LeRoy N, Khoroshevskyi O, OBrien A, Stepien R, Arslan A, Sheffield N Gigascience. 2024; 13.

PMID: 38991851 PMC: 11238423. DOI: 10.1093/gigascience/giae033.


BGCFlow: systematic pangenome workflow for the analysis of biosynthetic gene clusters across large genomic datasets.

Nuhamunada M, Mohite O, Phaneuf P, Palsson B, Weber T Nucleic Acids Res. 2024; 52(10):5478-5495.

PMID: 38686794 PMC: 11162802. DOI: 10.1093/nar/gkae314.


References
1.
Volchenboum S, Cox S, Heath A, Resnick A, Cohn S, Grossman R . Data Commons to Support Pediatric Cancer Research. Am Soc Clin Oncol Educ Book. 2017; 37:746-752. DOI: 10.1200/EDBK_175029. View

2.
Smith J, Dutta A, Sathyan K, Guertin M, Sheffield N . PEPPRO: quality control and processing of nascent RNA profiling data. Genome Biol. 2021; 22(1):155. PMC: 8126160. DOI: 10.1186/s13059-021-02349-4. View

3.
Jupp S, Malone J, Bolleman J, Brandizi M, Davies M, Garcia L . The EBI RDF platform: linked open data for the life sciences. Bioinformatics. 2014; 30(9):1338-9. PMC: 3998127. DOI: 10.1093/bioinformatics/btt765. View

4.
Afgan E, Baker D, van den Beek M, Blankenberg D, Bouvier D, cech M . The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic Acids Res. 2016; 44(W1):W3-W10. PMC: 4987906. DOI: 10.1093/nar/gkw343. View

5.
Kurtzer G, Sochat V, Bauer M . Singularity: Scientific containers for mobility of compute. PLoS One. 2017; 12(5):e0177459. PMC: 5426675. DOI: 10.1371/journal.pone.0177459. View