Interpreting Alignment-free Sequence Comparison: What Makes a Score a Good Score?

Overview

Journal NAR Genom Bioinform

Publisher Oxford University Press

Specialty Biology

Date 2022 Sep 8

PMID 36071721

Authors

Martin T Swain

Martin Vickers

Affiliations

Soon will be listed here.

Abstract

Alignment-free methods are alternatives to alignment-based methods when searching sequence data sets. The output from an alignment-free sequence comparison is a similarity score, the interpretation of which is not straightforward. We propose objective functions to interpret and calibrate outputs from alignment-free searches, noting that different objective functions are necessary for different biological contexts. This leads to advantages: visualising and comparing score distributions, including those from true positives, may be a relatively simple method to gain insight into the performance of different metrics. Using an empirical approach with both DNA and protein sequences, we characterise different similarity score distributions generated under different parameters. In particular, we demonstrate how sequence length can affect the scores. We show that scores of true positive sequence pairs may correlate significantly with their mean length; and even if the correlation is weak, the relative difference in length of the sequence pair may significantly reduce the effectiveness of alignment-free metrics. Importantly, we show how objective functions can be used with test data to accurately estimate the probability of true positives. This can significantly increase the utility of alignment-free approaches. Finally, we have developed a general-purpose software tool called KAST for use in high-throughput workflows on Linux clusters.

Citing Articles

TreeWave: command line tool for alignment-free phylogeny reconstruction based on graphical representation of DNA sequences and genomic signal processing.

Boumajdi N, Bendani H, Belyamani L, Ibrahimi A BMC Bioinformatics. 2024; 25(1):367.

PMID: 39604838 PMC: 11600722. DOI: 10.1186/s12859-024-05992-3.

Inference of the Life Cycle of Environmental Phages from Genomic Signature Distances to Their Hosts.

Arnau V, Diaz-Villanueva W, Mifsut Benet J, Villasante P, Beamud B, Mompo P Viruses. 2023; 15(5).

PMID: 37243281 PMC: 10222151. DOI: 10.3390/v15051196.

References

Langmead B, Wilks C, Antonescu V, Charles R . Scaling read aligners to hundreds of threads on general-purpose processors. Bioinformatics. 2018; 35(3):421-432. PMC: 6361242. DOI: 10.1093/bioinformatics/bty648. View

Cattaneo G, Ferraro Petrillo U, Giancarlo R, Palini F, Romualdi C . The power of word-frequency-based alignment-free functions: a comprehensive large-scale experimental analysis. Bioinformatics. 2021; 38(4):925-932. DOI: 10.1093/bioinformatics/btab747. View

Almeida J . Sequence analysis by iterated maps, a review. Brief Bioinform. 2013; 15(3):369-75. PMC: 4017330. DOI: 10.1093/bib/bbt072. View

Ye S, Siddle K, Park D, Sabeti P . Benchmarking Metagenomics Tools for Taxonomic Classification. Cell. 2019; 178(4):779-794. PMC: 6716367. DOI: 10.1016/j.cell.2019.07.010. View

Bromberg R, Grishin N, Otwinowski Z . Phylogeny Reconstruction with Alignment-Free Method That Corrects for Horizontal Gene Transfer. PLoS Comput Biol. 2016; 12(6):e1004985. PMC: 4918981. DOI: 10.1371/journal.pcbi.1004985. View

Karlin S, Burge C . Dinucleotide relative abundance extremes: a genomic signature. Trends Genet. 1995; 11(7):283-90. DOI: 10.1016/s0168-9525(00)89076-9. View

Patro R, Mount S, Kingsford C . Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat Biotechnol. 2014; 32(5):462-4. PMC: 4077321. DOI: 10.1038/nbt.2862. View

Luczak B, James B, Girgis H . A survey and evaluations of histogram-based statistics in alignment-free sequence comparison. Brief Bioinform. 2017; 20(4):1222-1237. PMC: 6781583. DOI: 10.1093/bib/bbx161. View

Blaisdell B . A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Natl Acad Sci U S A. 1986; 83(14):5155-9. PMC: 323909. DOI: 10.1073/pnas.83.14.5155. View

10.

James B, Luczak B, Girgis H . MeShClust: an intelligent tool for clustering DNA sequences. Nucleic Acids Res. 2018; 46(14):e83. PMC: 6101578. DOI: 10.1093/nar/gky315. View

11.

Deschavanne P, Giron A, Vilain J, Fagot G, Fertil B . Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol Biol Evol. 1999; 16(10):1391-9. DOI: 10.1093/oxfordjournals.molbev.a026048. View

12.

Meinicke P . UProC: tools for ultra-fast protein domain classification. Bioinformatics. 2014; 31(9):1382-8. PMC: 4410661. DOI: 10.1093/bioinformatics/btu843. View

13.

Pride D, Meinersmann R, Wassenaar T, Blaser M . Evolutionary implications of microbial genome tetranucleotide frequency biases. Genome Res. 2003; 13(2):145-58. PMC: 420360. DOI: 10.1101/gr.335003. View

14.

Pinello L, Lo Bosco G, Yuan G . Applications of alignment-free methods in epigenomics. Brief Bioinform. 2013; 15(3):419-30. PMC: 4017331. DOI: 10.1093/bib/bbt078. View

15.

Misale C, Ferrero G, Torquati M, Aldinucci M . Sequence alignment tools: one parallel pattern to rule them all?. Biomed Res Int. 2014; 2014:539410. PMC: 4131566. DOI: 10.1155/2014/539410. View

16.

Zielezinski A, Vinga S, Almeida J, Karlowski W . Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 2017; 18(1):186. PMC: 5627421. DOI: 10.1186/s13059-017-1319-7. View

17.

Girgis H, James B, Luczak B . : rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models. NAR Genom Bioinform. 2021; 3(1):lqab001. PMC: 7850047. DOI: 10.1093/nargab/lqab001. View

18.

Vinga S, Almeida J . Alignment-free sequence comparison-a review. Bioinformatics. 2003; 19(4):513-23. DOI: 10.1093/bioinformatics/btg005. View

19.

Horwege S, Lindner S, Boden M, Hatje K, Kollmar M, Leimeister C . Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches. Nucleic Acids Res. 2014; 42(Web Server issue):W7-11. PMC: 4086093. DOI: 10.1093/nar/gku398. View

20.

Jeffrey H . Chaos game representation of gene structure. Nucleic Acids Res. 1990; 18(8):2163-70. PMC: 330698. DOI: 10.1093/nar/18.8.2163. View