FastANOVA: an Efficient Algorithm for Genome-Wide Association Study

Overview

Journal KDD

Date 2010 Oct 16

PMID 20945829

Citations 17

Authors

Xiang Zhang

Fei Zou

Wei Wang

Affiliations

Soon will be listed here.

Abstract

Studying the association between quantitative phenotype (such as height or weight) and single nucleotide polymorphisms (SNPs) is an important problem in biology. To understand underlying mechanisms of complex phenotypes, it is often necessary to consider joint genetic effects across multiple SNPs. ANOVA (analysis of variance) test is routinely used in association study. Important findings from studying gene-gene (SNP-pair) interactions are appearing in the literature. However, the number of SNPs can be up to millions. Evaluating joint effects of SNPs is a challenging task even for SNP-pairs. Moreover, with large number of SNPs correlated, permutation procedure is preferred over simple Bonferroni correction for properly controlling family-wise error rate and retaining mapping power, which dramatically increases the computational cost of association study.In this paper, we study the problem of finding SNP-pairs that have significant associations with a given quantitative phenotype. We propose an efficient algorithm, FastANOVA, for performing ANOVA tests on SNP-pairs in a batch mode, which also supports large permutation test. We derive an upper bound of SNP-pair ANOVA test, which can be expressed as the sum of two terms. The first term is based on single-SNP ANOVA test. The second term is based on the SNPs and independent of any phenotype permutation. Furthermore, SNP-pairs can be organized into groups, each of which shares a common upper bound. This allows for maximum reuse of intermediate computation, efficient upper bound estimation, and effective SNP-pair pruning. Consequently, FastANOVA only needs to perform the ANOVA test on a small number of candidate SNP-pairs without the risk of missing any significant ones. Extensive experiments demonstrate that FastANOVA is orders of magnitude faster than the brute-force implementation of ANOVA tests on all SNP pairs.

Citing Articles

EpiMOGA: An Epistasis Detection Method Based on a Multi-Objective Genetic Algorithm.

Chen Y, Xu F, Pian C, Xu M, Kong L, Fang J Genes (Basel). 2021; 12(2).

PMID: 33525573 PMC: 7911965. DOI: 10.3390/genes12020191.

Epi-GTBN: an approach of epistasis mining based on genetic Tabu algorithm and Bayesian network.

Guo Y, Zhong Z, Yang C, Hu J, Jiang Y, Liang Z BMC Bioinformatics. 2019; 20(1):444.

PMID: 31455207 PMC: 6712799. DOI: 10.1186/s12859-019-3022-z.

The early transcriptome response of cassava (Manihot esculenta Crantz) to mealybug (Phenacoccus manihoti) feeding.

Rauwane M, Odeny D, Millar I, Rey C, Rees J PLoS One. 2018; 13(8):e0202541.

PMID: 30133510 PMC: 6105004. DOI: 10.1371/journal.pone.0202541.

The search for gene-gene interactions in genome-wide association studies: challenges in abundance of methods, practical considerations, and biological interpretation.

Ritchie M, Van Steen K Ann Transl Med. 2018; 6(8):157.

PMID: 29862246 PMC: 5952010. DOI: 10.21037/atm.2018.04.05.

An Efficient Nonlinear Regression Approach for Genome-wide Detection of Marginal and Interacting Genetic Variations.

Lee S, Lozano A, Kambadur P, Xing E J Comput Biol. 2016; 23(5):372-89.

PMID: 27159633 PMC: 4876555. DOI: 10.1089/cmb.2015.0202.

References

Carlborg O, Andersson L, Kinghorn B . The use of a genetic algorithm for simultaneous mapping of multiple interacting quantitative trait loci. Genetics. 2000; 155(4):2003-10. PMC: 1461191. DOI: 10.1093/genetics/155.4.2003. View

Ritchie M, Hahn L, Roodi N, BAILEY L, Dupont W, Parl F . Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001; 69(1):138-47. PMC: 1226028. DOI: 10.1086/321276. View

Shimomura K, King D, Steeves T, Whiteley A, Kushla J, Zemenides P . Genome-wide epistatic interaction analysis reveals complex genetic determinants of circadian behavior in mice. Genome Res. 2001; 11(6):959-80. DOI: 10.1101/gr.171601. View

Halperin E, Kimmel G, Shamir R . Tag SNP selection in genotype data for maximizing SNP prediction accuracy. Bioinformatics. 2005; 21 Suppl 1:i195-203. DOI: 10.1093/bioinformatics/bti1021. View

Roberts A, McMillan L, Wang W, Parker J, Rusyn I, Threadgill D . Inferring missing genotypes in large SNP panels using fast nearest-neighbor searches over sliding windows. Bioinformatics. 2007; 23(13):i401-7. DOI: 10.1093/bioinformatics/btm220. View

Balding D . A tutorial on statistical methods for population association studies. Nat Rev Genet. 2006; 7(10):781-91. DOI: 10.1038/nrg1916. View

Segre D, DeLuna A, Church G, Kishony R . Modular epistasis in yeast metabolism. Nat Genet. 2004; 37(1):77-83. DOI: 10.1038/ng1489. View

Doerge R . Mapping and analysis of quantitative trait loci in experimental populations. Nat Rev Genet. 2002; 3(1):43-52. DOI: 10.1038/nrg703. View

Hoh J, Wille A, Zee R, Cheng S, Reynolds R, Lindpaintner K . Selecting SNPs in two-stage analysis of disease association data: a model-free approach. Ann Hum Genet. 2001; 64(Pt 5):413-7. DOI: 10.1046/j.1469-1809.2000.6450413.x. View

10.

Scuteri A, Sanna S, Chen W, Uda M, Albai G, Strait J . Genome-wide association scan shows genetic variants in the FTO gene are associated with obesity-related traits. PLoS Genet. 2007; 3(7):e115. PMC: 1934391. DOI: 10.1371/journal.pgen.0030115. View

11.

Sebastiani P, Lazarus R, Weiss S, Kunkel L, Kohane I, Ramoni M . Minimal haplotype tagging. Proc Natl Acad Sci U S A. 2003; 100(17):9900-5. PMC: 187880. DOI: 10.1073/pnas.1633613100. View

12.

Evans D, Marchini J, Morris A, Cardon L . Two-stage two-locus models in genome-wide association. PLoS Genet. 2006; 2(9):e157. PMC: 1570380. DOI: 10.1371/journal.pgen.0020157. View

13.

Carlson C, Eberle M, Kruglyak L, Nickerson D . Mapping complex disease loci in whole-genome association studies. Nature. 2004; 429(6990):446-52. DOI: 10.1038/nature02623. View

14.

Nakamichi R, Ukai Y, Kishino H . Detection of closely linked multiple quantitative trait loci using a genetic algorithm. Genetics. 2001; 158(1):463-75. PMC: 1461641. DOI: 10.1093/genetics/158.1.463. View

15.

Weedon M, Lettre G, Freathy R, Lindgren C, Voight B, Perry J . A common variant of HMGA2 is associated with adult and childhood height in the general population. Nat Genet. 2007; 39(10):1245-50. PMC: 3086278. DOI: 10.1038/ng2121. View

16.

Ohno Y, Tanase H, Nabika T, Otsuka K, Sasaki T, Suzawa T . Selective genotyping with epistasis can be utilized for a major quantitative trait locus mapping in hypertension in rats. Genetics. 2000; 155(2):785-92. PMC: 1461129. DOI: 10.1093/genetics/155.2.785. View

17.

Nelson M, Kardia S, Ferrell R, Sing C . A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Res. 2001; 11(3):458-70. PMC: 311041. DOI: 10.1101/gr.172901. View

18.

Ideraabdullah F, de la Casa-Esperon E, Bell T, Detwiler D, Magnuson T, Sapienza C . Genetic and haplotype diversity among wild-derived mouse inbred strains. Genome Res. 2004; 14(10A):1880-7. PMC: 524411. DOI: 10.1101/gr.2519704. View

19.

Saxena R, Voight B, Lyssenko V, Burtt N, de Bakker P, Chen H . Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science. 2007; 316(5829):1331-6. DOI: 10.1126/science.1142358. View

20.

Wade C, Daly M . Genetic variation in laboratory mice. Nat Genet. 2005; 37(11):1175-80. DOI: 10.1038/ng1666. View