A Bayesian Approach to Efficient Differential Allocation for Resampling-based Significance Testing

Overview

Journal BMC Bioinformatics

Publisher Biomed Central

Specialty Biology

Date 2009 Jun 30

PMID 19558706

Citations 3

Authors

Shane T Jensen

Sameer Soi

Li-San Wang

Affiliations

Soon will be listed here.

Abstract

Background: Large-scale statistical analyses have become hallmarks of post-genomic era biological research due to advances in high-throughput assays and the integration of large biological databases. One accompanying issue is the simultaneous estimation of p-values for a large number of hypothesis tests. In many applications, a parametric assumption in the null distribution such as normality may be unreasonable, and resampling-based p-values are the preferred procedure for establishing statistical significance. Using resampling-based procedures for multiple testing is computationally intensive and typically requires large numbers of resamples.

Results: We present a new approach to more efficiently assign resamples (such as bootstrap samples or permutations) within a nonparametric multiple testing framework. We formulated a Bayesian-inspired approach to this problem, and devised an algorithm that adapts the assignment of resamples iteratively with negligible space and running time overhead. In two experimental studies, a breast cancer microarray dataset and a genome wide association study dataset for Parkinson's disease, we demonstrated that our differential allocation procedure is substantially more accurate compared to the traditional uniform resample allocation.

Conclusion: Our experiments demonstrate that using a more sophisticated allocation strategy can improve our inference for hypothesis testing without a drastic increase in the amount of computation on randomized data. Moreover, we gain more improvement in efficiency when the number of tests is large. R code for our algorithm and the shortcut method are available at http://people.pcbi.upenn.edu/~lswang/pub/bmc2009/.

Citing Articles

Assessing differential expression in two-color microarrays: a resampling-based empirical Bayes approach.

Li D, Le Pape M, Parikh N, Chen W, Dye T PLoS One. 2013; 8(11):e80099.

PMID: 24312198 PMC: 3842292. DOI: 10.1371/journal.pone.0080099.

Analysis of Correlated Gene Expression Data on Ordered Categories.

Peddada S, Harris S, Davidov O J Indian Soc Agric Stat. 2011; 64(1):45-60.

PMID: 21998487 PMC: 3190572.

FastPval: a fast and memory efficient program to calculate very low P-values from empirical distribution.

Li M, Sham P, Wang J Bioinformatics. 2010; 26(22):2897-9.

PMID: 20861029 PMC: 2971576. DOI: 10.1093/bioinformatics/btq540.

References

Yang H, Churchill G . Estimating p-values in small microarray experiments. Bioinformatics. 2006; 23(1):38-43. DOI: 10.1093/bioinformatics/btl548. View

Laird N, Lange C . Family-based designs in the age of large-scale gene-association studies. Nat Rev Genet. 2006; 7(5):385-94. DOI: 10.1038/nrg1839. View

Jain N, Cho H, OConnell M, Lee J . Rank-invariant resampling based estimation of false discovery rate for analysis of small sample microarray data. BMC Bioinformatics. 2005; 6:187. PMC: 1187876. DOI: 10.1186/1471-2105-6-187. View

Peddada S, Lobenhofer E, Li L, Afshari C, Weinberg C, Umbach D . Gene selection and clustering for time-course and dose-response microarray experiments using order-restricted inference. Bioinformatics. 2003; 19(7):834-41. DOI: 10.1093/bioinformatics/btg093. View

Fung H, Scholz S, Matarin M, Simon-Sanchez J, Hernandez D, Britton A . Genome-wide genotyping in Parkinson's disease and neurologically normal controls: first stage analysis and public release of data. Lancet Neurol. 2006; 5(11):911-6. DOI: 10.1016/S1474-4422(06)70578-6. View

Subramanian A, Tamayo P, Mootha V, Mukherjee S, Ebert B, Gillette M . Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005; 102(43):15545-50. PMC: 1239896. DOI: 10.1073/pnas.0506580102. View

Scheid S, Spang R . A stochastic downhill search algorithm for estimating the local false discovery rate. IEEE/ACM Trans Comput Biol Bioinform. 2006; 1(3):98-108. DOI: 10.1109/TCBB.2004.24. View

Storey J, Tibshirani R . Statistical significance for genomewide studies. Proc Natl Acad Sci U S A. 2003; 100(16):9440-5. PMC: 170937. DOI: 10.1073/pnas.1530509100. View

Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R . Gene-expression profiles in hereditary breast cancer. N Engl J Med. 2001; 344(8):539-48. DOI: 10.1056/NEJM200102223440801. View

10.

Reiner A, Yekutieli D, Benjamini Y . Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics. 2003; 19(3):368-75. DOI: 10.1093/bioinformatics/btf877. View

11.

Xie Y, Pan W, Khodursky A . A note on using permutation-based false discovery rate estimates to compare different analysis methods for microarray data. Bioinformatics. 2005; 21(23):4280-8. DOI: 10.1093/bioinformatics/bti685. View

12.

Efron B, Tibshirani R . Empirical bayes methods and false discovery rates for microarrays. Genet Epidemiol. 2002; 23(1):70-86. DOI: 10.1002/gepi.1124. View

13.

Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira M, Bender D . PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007; 81(3):559-75. PMC: 1950838. DOI: 10.1086/519795. View

14.

Diskin S, Eck T, Greshock J, Mosse Y, Naylor T, Stoeckert Jr C . STAC: A method for testing the significance of DNA copy number aberrations across multiple array-CGH experiments. Genome Res. 2006; 16(9):1149-58. PMC: 1557772. DOI: 10.1101/gr.5076506. View

15.

Tusher V, Tibshirani R, Chu G . Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 2001; 98(9):5116-21. PMC: 33173. DOI: 10.1073/pnas.091062498. View