» Articles » PMID: 39975128

Beyond Blacklists: A Critical Assessment of Exclusion Set Generation Strategies and Alternative Approaches

Overview
Journal bioRxiv
Date 2025 Feb 20
PMID 39975128
Authors
Affiliations
Soon will be listed here.
Abstract

Short-read sequencing data can be affected by alignment artifacts in certain genomic regions. Removing reads overlapping these exclusion regions, previously known as Blacklists, help to potentially improve biological signal. Tools like the widely used Blacklist software facilitate this process, but their algorithmic details and parameter choices are not always clearly documented, affecting reproducibility and biological relevance. We examined the Blacklist software and found that pre-generated exclusion sets were difficult to reproduce due to variability in input data, aligner choice, and read length. We also identified and addressed a coding issue that led to over-annotation of high-signal regions. We further explored the use of "sponge" sequences-unassembled genomic regions such as satellite DNA, ribosomal DNA, and mitochondrial DNA-as an alternative approach. Aligning reads to a genome that includes sponge sequences reduced signal correlation in ChIP-seq data comparably to Blacklist-derived exclusion sets while preserving biological signal. Sponge-based alignment also had minimal impact on RNA-seq gene counts, suggesting broader applicability beyond chromatin profiling. These results highlight the limitations of fixed exclusion sets and suggest that sponge sequences offer a flexible, alignment-guided strategy for reducing artifacts and improving functional genomics analyses.

References
1.
Boyd D, Zboril E, Olex A, Leftwich T, Hairr N, Byers H . Discovering Synergistic Compounds with BYL-719 in PI3K Overactivated Basal-like PDXs. Cancers (Basel). 2023; 15(5). PMC: 10001201. DOI: 10.3390/cancers15051582. View

2.
Kopylova E, Noe L, Touzet H . SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data. Bioinformatics. 2012; 28(24):3211-7. DOI: 10.1093/bioinformatics/bts611. View

3.
Miga K, Eisenhart C, Kent W . Utilizing mapping targets of sequences underrepresented in the reference assembly to reduce false positive alignments. Nucleic Acids Res. 2015; 43(20):e133. PMC: 4787761. DOI: 10.1093/nar/gkv671. View

4.
Ewels P, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A . The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020; 38(3):276-278. DOI: 10.1038/s41587-020-0439-x. View

5.
. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 489(7414):57-74. PMC: 3439153. DOI: 10.1038/nature11247. View