From Contigs Towards Chromosomes: Automatic Improvement of Long Read Assemblies (ILRA)

Overview

Journal Brief Bioinform

Publisher Oxford University Press

Specialty Biology

Date 2023 Jul 5

PMID 37406192

Authors

Jose Luis Ruiz

Susanne Reimering

Juan David Escobar-Prieto

Nicolas M B Brancucci

Diego F Echeverry

Abdirahman I Abdi

Matthias Marti

Elena Gomez-Diaz

Thomas D Otto

Affiliations

Soon will be listed here.

Abstract

Recent advances in long read technologies not only enable large consortia to aim to sequence all eukaryotes on Earth, but they also allow individual laboratories to sequence their species of interest with relatively low investment. Long read technologies embody the promise of overcoming scaffolding problems associated with repeats and low complexity sequences, but the number of contigs often far exceeds the number of chromosomes and they may contain many insertion and deletion errors around homopolymer tracts. To overcome these issues, we have implemented the ILRA pipeline to correct long read-based assemblies. Contigs are first reordered, renamed, merged, circularized, or filtered if erroneous or contaminated. Illumina short reads are used subsequently to correct homopolymer errors. We successfully tested our approach by improving the genome sequences of Homo sapiens, Trypanosoma brucei, and Leptosphaeria spp., and by generating four novel Plasmodium falciparum assemblies from field samples. We found that correcting homopolymer tracts reduced the number of genes incorrectly annotated as pseudogenes, but an iterative approach seems to be required to correct more sequencing errors. In summary, we describe and benchmark the performance of our new tool, which improved the quality of novel long read assemblies up to 1 Gbp. The pipeline is available at GitHub: https://github.com/ThomasDOtto/ILRA.

Citing Articles

Comparison of Nanopore with Illumina Whole Genome Assemblies of the Epstein-Barr Virus in Burkitt Lymphoma.

Kim Jr I, Kim I, Fola A, Puig E, Maina T, Hui S medRxiv. 2025; .

PMID: 40061313 PMC: 11888525. DOI: 10.1101/2025.02.21.25322471.

Genome drafts of strains C2 and C3 isolated from honey bees in Spain.

Ruiz J, Marin A, Carreira de Paula J, Gomez-Moracho T, Garcia Olmedo P, Andres-Leon E Microbiol Resour Announc. 2024; 14(1):e0064224.

PMID: 39601516 PMC: 11737077. DOI: 10.1128/mra.00642-24.

Long-Read Sequencing and Genome Assembly Pipeline of Two Clones (3D7, W2) Using Only the PromethION Sequencer from Oxford Nanopore Technologies without Whole-Genome Amplification.

Delandre O, Lamer O, Loreau J, Papa Mze N, Fonta I, Mosnier J Biology (Basel). 2024; 13(2).

PMID: 38392307 PMC: 10886359. DOI: 10.3390/biology13020089.

Highly accurate genome assembly of an improved high-yielding silkworm strain, Nichi01.

Waizumi R, Tsubota T, Jouraku A, Kuwazaki S, Yokoi K, Iizuka T G3 (Bethesda). 2023; 13(4).

PMID: 36814357 PMC: 10085791. DOI: 10.1093/g3journal/jkad044.

References

Mikheenko A, Prjibelski A, Saveliev V, Antipov D, Gurevich A . Versatile genome assembly evaluation with QUAST-LG. Bioinformatics. 2018; 34(13):i142-i150. PMC: 6022658. DOI: 10.1093/bioinformatics/bty266. View

Morgulis A, Coulouris G, Raytselis Y, Madden T, Agarwala R, Schaffer A . Database indexing for production MegaBLAST searches. Bioinformatics. 2008; 24(16):1757-64. PMC: 2696921. DOI: 10.1093/bioinformatics/btn322. View

Tan M, Austin C, Hammer M, Lee Y, Croft L, Gan H . Finding Nemo: hybrid assembly with Oxford Nanopore and Illumina reads greatly improves the clownfish (Amphiprion ocellaris) genome assembly. Gigascience. 2018; 7(3):1-6. PMC: 5848817. DOI: 10.1093/gigascience/gix137. View

Zimin A, Marcais G, Puiu D, Roberts M, Salzberg S, Yorke J . The MaSuRCA genome assembler. Bioinformatics. 2013; 29(21):2669-77. PMC: 3799473. DOI: 10.1093/bioinformatics/btt476. View

Olarerin-George A, Hogenesch J . Assessing the prevalence of mycoplasma contamination in cell culture via a survey of NCBI's RNA-seq archive. Nucleic Acids Res. 2015; 43(5):2535-42. PMC: 4357728. DOI: 10.1093/nar/gkv136. View

Kingan S, Heaton H, Cudini J, Lambert C, Baybayan P, Galvin B . A High-Quality Genome Assembly from a Single Mosquito Using PacBio Sequencing. Genes (Basel). 2019; 10(1). PMC: 6357164. DOI: 10.3390/genes10010062. View

Drexler H, Uphoff C . Mycoplasma contamination of cell cultures: Incidence, sources, effects, detection, elimination, prevention. Cytotechnology. 2008; 39(2):75-90. PMC: 3463982. DOI: 10.1023/A:1022913015916. View

Simao F, Waterhouse R, Ioannidis P, Kriventseva E, Zdobnov E . BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015; 31(19):3210-2. DOI: 10.1093/bioinformatics/btv351. View

Steinbiss S, Silva-Franco F, Brunk B, Foth B, Hertz-Fowler C, Berriman M . Companion: a web server for annotation and analysis of parasite genomes. Nucleic Acids Res. 2016; 44(W1):W29-34. PMC: 4987884. DOI: 10.1093/nar/gkw292. View

10.

Garg S, Rautiainen M, Novak A, Garrison E, Durbin R, Marschall T . A graph-based approach to diploid genome assembly. Bioinformatics. 2018; 34(13):i105-i114. PMC: 6022571. DOI: 10.1093/bioinformatics/bty279. View

11.

Chen Z, Erickson D, Meng J . Benchmarking hybrid assembly approaches for genomic analyses of bacterial pathogens using Illumina and Oxford Nanopore sequencing. BMC Genomics. 2020; 21(1):631. PMC: 7490894. DOI: 10.1186/s12864-020-07041-8. View

12.

Swain M, Tsai I, Assefa S, Newbold C, Berriman M, Otto T . A post-assembly genome-improvement toolkit (PAGIT) to obtain annotated genomes from contigs. Nat Protoc. 2012; 7(7):1260-84. PMC: 3648784. DOI: 10.1038/nprot.2012.068. View

13.

Lang D, Zhang S, Ren P, Liang F, Sun Z, Meng G . Comparison of the two up-to-date sequencing technologies for genome assembly: HiFi reads of Pacific Biosciences Sequel II system and ultralong reads of Oxford Nanopore. Gigascience. 2020; 9(12). PMC: 7736813. DOI: 10.1093/gigascience/giaa123. View

14.

Zhang X, Liu C, Yang S, Wang X, Bai F, Wang Z . Benchmarking of long-read sequencing, assemblers and polishers for yeast genome. Brief Bioinform. 2022; 23(3). DOI: 10.1093/bib/bbac146. View

15.

Sanderson N, Kapel N, Rodger G, Webster H, Lipworth S, Street T . Comparison of R9.4.1/Kit10 and R10/Kit12 Oxford Nanopore flowcells and chemistries in bacterial genome reconstruction. Microb Genom. 2023; 9(1). PMC: 9973852. DOI: 10.1099/mgen.0.000910. View

16.

Kronenberg Z, Rhie A, Koren S, Concepcion G, Peluso P, Munson K . Extended haplotype-phasing of long-read de novo genome assemblies using Hi-C. Nat Commun. 2021; 12(1):1935. PMC: 8081726. DOI: 10.1038/s41467-020-20536-y. View

17.

Korhonen P, Hall R, Young N, Gasser R . Common workflow language (CWL)-based software pipeline for de novo genome assembly from long- and short-read data. Gigascience. 2019; 8(4). PMC: 6451199. DOI: 10.1093/gigascience/giz014. View

18.

Sereika M, Kirkegaard R, Karst S, Michaelsen T, Sorensen E, Wollenberg R . Oxford Nanopore R10.4 long-read sequencing enables the generation of near-finished bacterial genomes from pure cultures and metagenomes without short-read or reference polishing. Nat Methods. 2022; 19(7):823-826. PMC: 9262707. DOI: 10.1038/s41592-022-01539-7. View

19.

Otto T, Bohme U, Sanders M, Reid A, Bruske E, Duffy C . Long read assemblies of geographically dispersed isolates reveal highly structured subtelomeres. Wellcome Open Res. 2018; 3:52. PMC: 5964635. DOI: 10.12688/wellcomeopenres.14571.1. View

20.

Sacristan-Horcajada E, Gonzalez-de la Fuente S, Peiro-Pastor R, Carrasco-Ramiro F, Amils R, Requena J . ARAMIS: From systematic errors of NGS long reads to accurate assemblies. Brief Bioinform. 2021; 22(6). PMC: 8574707. DOI: 10.1093/bib/bbab170. View