Linking Big Biomedical Datasets to Modular Analysis with Portable Encapsulated Projects
Overview
Affiliations
Background: Organizing and annotating biological sample data is critical in data-intensive bioinformatics. Unfortunately, metadata formats from a data provider are often incompatible with requirements of a processing tool. There is no broadly accepted standard to organize metadata across biological projects and bioinformatics tools, restricting the portability and reusability of both annotated datasets and analysis software.
Results: To address this, we present the Portable Encapsulated Project (PEP) specification, a formal specification for biological sample metadata structure. The PEP specification accommodates typical features of data-intensive bioinformatics projects with many biological samples. In addition to standardization, the PEP specification provides descriptors and modifiers for project-level and sample-level metadata, which improve portability across both computing environments and data processing tools. PEPs include a schema validator framework, allowing formal definition of required metadata attributes for data analysis broadly. We have implemented packages for reading PEPs in both Python and R to provide a language-agnostic interface for organizing project metadata.
Conclusions: The PEP specification is an important step toward unifying data annotation and processing tools in data-intensive biological research projects. Links to tools and documentation are available at http://pep.databio.org/.
FAIRSCAPE: An Evolving AI-readiness Framework for Biomedical Research.
Al Manir S, Levinson M, Niestroy J, Churas C, Parker J, Clark T bioRxiv. 2025; .
PMID: 39763764 PMC: 11703166. DOI: 10.1101/2024.12.23.629818.
Expanding the genome information on Bacillales for biosynthetic gene cluster discovery.
Song L, Nielsen L, Xu X, Mohite O, Nuhamunada M, Xu Z Sci Data. 2024; 11(1):1267.
PMID: 39572589 PMC: 11582795. DOI: 10.1038/s41597-024-04118-x.
Kananen K, Veseli I, Quiles Perez C, Miller S, Eren A, Bradley P bioRxiv. 2024; .
PMID: 39005339 PMC: 11245035. DOI: 10.1101/2024.07.03.601779.
LeRoy N, Khoroshevskyi O, OBrien A, Stepien R, Arslan A, Sheffield N Gigascience. 2024; 13.
PMID: 38991851 PMC: 11238423. DOI: 10.1093/gigascience/giae033.
Nuhamunada M, Mohite O, Phaneuf P, Palsson B, Weber T Nucleic Acids Res. 2024; 52(10):5478-5495.
PMID: 38686794 PMC: 11162802. DOI: 10.1093/nar/gkae314.