FindGSE: Estimating Genome Size Variation Within Human and Arabidopsis Using K-mer Frequencies
Overview
Affiliations
Motivation: Analyzing k-mer frequencies in whole-genome sequencing data is becoming a common method for estimating genome size (GS). However, it remains uninvestigated how accurate the method is, especially if it can capture intra-species GS variation.
Results: We present findGSE, which fits skew normal distributions to k-mer frequencies to estimate GS. findGSE outperformed existing tools in an extensive simulation study. Estimating GSs of 89 Arabidopsis thaliana accessions, findGSE showed the highest capability in capturing GS variations. In an application with 71 female and 71 male human individuals, findGSE delivered an average of 3039 Mb as haploid human GS, while female genomes were on average 41 Mb larger than male genomes, in astonishing agreement with size difference of the X and Y chromosomes. Further analysis showed that human GS variations link to geographical patterns and significant differences between populations, which can be explained by variable abundances of LINE-1 retrotransposons.
Availability And Implementation: R package of findGSE is freely available at https://github.com/schneebergerlab/findGSE and supported on linux and Mac systems.
Contact: schneeberger@mpipz.mpg.de.
Supplementary Information: Supplementary data are available at Bioinformatics online.
A Chromosome-level genome assembly of the American bullfrog (Aquarana catesbeiana).
Zhang K, Zhang Y, Tian Y, Xu B, Jiang X, Qin Z Sci Data. 2025; 12(1):413.
PMID: 40064910 PMC: 11893809. DOI: 10.1038/s41597-025-04697-3.
Jia K, Li G, Wang L, Liu M, Wang Z, Li R Hortic Res. 2025; 12(3):uhae337.
PMID: 40061812 PMC: 11886820. DOI: 10.1093/hr/uhae337.
-mer approaches for biodiversity genomics.
Jenike K, Campos-Dominguez L, Bodde M, Cerca J, Hodson C, Schatz M Genome Res. 2025; 35(2):219-230.
PMID: 39890468 PMC: 11874746. DOI: 10.1101/gr.279452.124.
Wang J, Wang J, Zhang W, Zhang W, Yang X, Yang X DNA Res. 2025; 32(2).
PMID: 39878035 PMC: 11879222. DOI: 10.1093/dnares/dsaf002.
A High-Quality Phased Genome Assembly of Stinging Nettle ( ssp. ).
Hirabayashi K, Dumigan C, Kucka M, Percy D, Guerriero G, Cronk Q Plants (Basel). 2025; 14(1.
PMID: 39795384 PMC: 11722821. DOI: 10.3390/plants14010124.