SeqTrim: a High-throughput Pipeline for Pre-processing Any Type of Sequence Read

Overview

Journal BMC Bioinformatics

Publisher Biomed Central

Specialty Biology

Date 2010 Jan 22

PMID 20089148

Citations 90

Authors

Juan Falgueras

Antonio J Lara

Noe Fernandez-Pozo

Francisco R Canton

Guillermo Perez-Trabado

M Gonzalo Claros

Affiliations

Soon will be listed here.

Abstract

Background: High-throughput automated sequencing has enabled an exponential growth rate of sequencing data. This requires increasing sequence quality and reliability in order to avoid database contamination with artefactual sequences. The arrival of pyrosequencing enhances this problem and necessitates customisable pre-processing algorithms.

Results: SeqTrim has been implemented both as a Web and as a standalone command line application. Already-published and newly-designed algorithms have been included to identify sequence inserts, to remove low quality, vector, adaptor, low complexity and contaminant sequences, and to detect chimeric reads. The availability of several input and output formats allows its inclusion in sequence processing workflows. Due to its specific algorithms, SeqTrim outperforms other pre-processors implemented as Web services or standalone applications. It performs equally well with sequences from EST libraries, SSH libraries, genomic DNA libraries and pyrosequencing reads and does not lead to over-trimming.

Conclusions: SeqTrim is an efficient pipeline designed for pre-processing of any type of sequence read, including next-generation sequencing. It is easily configurable and provides a friendly interface that allows users to know what happened with sequences at every pre-processing stage, and to verify pre-processing of an individual sequence if desired. The recommended pipeline reveals more information about each sequence than previously described pre-processors and can discard more sequencing or experimental artefacts.

Citing Articles

Cyclo(Pro-Tyr) elicits conserved cellular damage in fungi by targeting the [H]ATPase Pma1 in plasma membrane domains.

Vela-Corcia D, Hierrezuelo J, Perez-Lorente A, Stincone P, Pakkir Shah A, Grelard A Commun Biol. 2024; 7(1):1253.

PMID: 39362977 PMC: 11449911. DOI: 10.1038/s42003-024-06947-3.

Step-by-Step Metagenomics for Food Microbiome Analysis: A Detailed Review.

Sadurski J, Polak-Berecka M, Staniszewski A, Wasko A Foods. 2024; 13(14).

PMID: 39063300 PMC: 11276190. DOI: 10.3390/foods13142216.

Transcriptomic Insight into the Pollen Tube Growth of L. subsp. Reveals Reprogramming and Pollen-Specific Genes Including New Transcription Factors.

Bullones A, Castro A, Lima-Cabello E, Fernandez-Pozo N, Bautista R, de Dios Alche J Plants (Basel). 2023; 12(16).

PMID: 37631106 PMC: 10459472. DOI: 10.3390/plants12162894.

Sporulation Activated via σ Protects from a Tse1 Peptidoglycan Hydrolase Type VI Secretion System Effector.

Perez-Lorente A, Molina-Santiago C, de Vicente A, Romero D Microbiol Spectr. 2023; :e0504522.

PMID: 36916921 PMC: 10100999. DOI: 10.1128/spectrum.05045-22.

Construction of miRNA-mRNA networks for the identification of lung cancer biomarkers in liquid biopsies.

Espinosa Garcia E, Arroyo Varela M, Jimenez R, Gomez-Maldonado J, Cobo Dols M, Claros M Clin Transl Oncol. 2022; 25(3):643-652.

PMID: 36229739 PMC: 9941226. DOI: 10.1007/s12094-022-02969-7.

References

Coker J, Davies E . Identifying adaptor contamination when mining DNA sequence data. Biotechniques. 2004; 37(2):194, 196, 198. DOI: 10.2144/04372BM03. View

Chen Y, Lin C, Wang C, Wu H, Hwang P . An optimized procedure greatly improves EST vector contamination removal. BMC Genomics. 2007; 8:416. PMC: 2194723. DOI: 10.1186/1471-2164-8-416. View

Bonfield J, Smith K, Staden R . A new DNA sequence assembly program. Nucleic Acids Res. 1995; 23(24):4992-9. PMC: 307504. DOI: 10.1093/nar/23.24.4992. View

Liang F, Holt I, Pertea G, Karamycheva S, Salzberg S, Quackenbush J . An optimized protocol for analysis of EST sequences. Nucleic Acids Res. 2000; 28(18):3657-65. PMC: 110731. DOI: 10.1093/nar/28.18.3657. View

Chou H, Holmes M . DNA sequence quality trimming and vector removal. Bioinformatics. 2001; 17(12):1093-104. DOI: 10.1093/bioinformatics/17.12.1093. View

White J, Roberts M, Yorke J, Pop M . Figaro: a novel statistical method for vector sequence removal. Bioinformatics. 2008; 24(4):462-7. PMC: 2725436. DOI: 10.1093/bioinformatics/btm632. View

Forment J, Gilabert F, Robles A, Conejero V, Nuez F, Blanca J . EST2uni: an open, parallel tool for automated EST analysis and database creation, with a data mining web interface and microarray expression data integration. BMC Bioinformatics. 2008; 9:5. PMC: 2258287. DOI: 10.1186/1471-2105-9-5. View

Ewing B, Green P . Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 1998; 8(3):186-94. View

Ewing B, Hillier L, Wendl M, Green P . Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 1998; 8(3):175-85. DOI: 10.1101/gr.8.3.175. View

10.

Li S, Chou H . LUCY2: an interactive DNA sequence quality trimming and vector removal tool. Bioinformatics. 2004; 20(16):2865-6. DOI: 10.1093/bioinformatics/bth302. View

11.

Hotz-Wagenblatt A, Hankeln T, Ernst P, Glatting K, Schmidt E, Suhai S . ESTAnnotator: A tool for high throughput EST annotation. Nucleic Acids Res. 2003; 31(13):3716-9. PMC: 169160. DOI: 10.1093/nar/gkg566. View

12.

Seluja G, Farmer A, McLeod M, Harger C, Schad P . Establishing a method of vector contamination identification in database sequences. Bioinformatics. 1999; 15(2):106-10. DOI: 10.1093/bioinformatics/15.2.106. View

13.

Scheetz T, Trivedi N, Roberts C, Kucaba T, Berger B, Robinson N . ESTprep: preprocessing cDNA sequence reads. Bioinformatics. 2003; 19(11):1318-24. DOI: 10.1093/bioinformatics/btg159. View

14.

Lee B, Hong T, Byun S, Woo T, Choi Y . ESTpass: a web-based server for processing and annotating expressed sequence tag (EST) sequences. Nucleic Acids Res. 2007; 35(Web Server issue):W159-62. PMC: 1933161. DOI: 10.1093/nar/gkm369. View

15.

Jurka J . Repbase update: a database and an electronic journal of repetitive elements. Trends Genet. 2000; 16(9):418-20. DOI: 10.1016/s0168-9525(00)02093-x. View

16.

Masoudi-Nejad A, Tonomura K, Kawashima S, Moriya Y, Suzuki M, Itoh M . EGassembler: online bioinformatics service for large-scale processing, clustering and assembling ESTs and genomic DNA fragments. Nucleic Acids Res. 2006; 34(Web Server issue):W459-62. PMC: 1538775. DOI: 10.1093/nar/gkl066. View

17.

Nagaraj S, Deshpande N, Gasser R, Ranganathan S . ESTExplorer: an expressed sequence tag (EST) assembly and annotation platform. Nucleic Acids Res. 2007; 35(Web Server issue):W143-7. PMC: 1933243. DOI: 10.1093/nar/gkm378. View