The Earth is Flat ( > 0.05): Significance Thresholds and the Crisis of Unreplicable Research

Overview

Journal PeerJ

Specialties Biology
Environmental Health
General Medicine

Date 2017 Jul 13

PMID 28698825

Citations 81

Authors

Valentin Amrhein

Franzi Korner-Nievergelt

Tobias Roth

Affiliations

Soon will be listed here.

Abstract

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading -values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small -values at face value, but mistrust results with larger -values. In either case, -values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance ( ≤ 0.05) is hardly replicable: at a good statistical power of 80%, two studies will be 'conflicting', meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, -hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, -values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger -values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger -values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that -values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.

Citing Articles

Sex differences in romantic love: an evolutionary perspective.

Bode A, Luoto S, Kavanagh P Biol Sex Differ. 2025; 16(1):16.

PMID: 39994818 PMC: 11849325. DOI: 10.1186/s13293-025-00698-4.

The null hypothesis significance test and the dichotomization of the p-value: Errare Humanum Est.

Mezones-Holguin E, Al-Kassab-Cordova A, Soto-Becerra P, Hernandez-Diaz S, Kaufman J Rev Peru Med Exp Salud Publica. 2025; 41(4):422-430.

PMID: 39936767 PMC: 11797584. DOI: 10.17843/rpmesp.2024.414.14285..

Creative music therapy in preterm infants effects cerebrovascular oxygenation and perfusion.

Scholkmann F, Haslbeck F, Oba E, Restin T, Ostojic D, Kleiser S Sci Rep. 2024; 14(1):28249.

PMID: 39548130 PMC: 11568197. DOI: 10.1038/s41598-024-75282-8.

Estimating the replicability of highly cited clinical research (2004-2018).

da Costa G, Neves K, Amaral O PLoS One. 2024; 19(8):e0307145.

PMID: 39110675 PMC: 11305584. DOI: 10.1371/journal.pone.0307145.

For a proper use of frequentist inferential statistics in public health.

Rovetta A, Mansournia M, Vitale A Glob Epidemiol. 2024; 8:100151.

PMID: 39021384 PMC: 11252774. DOI: 10.1016/j.gloepi.2024.100151.

References

Greenland S . Nonsignificance plus high power does not imply support for the null over the alternative. Ann Epidemiol. 2012; 22(5):364-8. DOI: 10.1016/j.annepidem.2012.02.007. View

Lazzeroni L, Lu Y, Belitskaya-Levy I . P-values in genomics: apparent precision masks high uncertainty. Mol Psychiatry. 2014; 19(12):1336-40. PMC: 4255087. DOI: 10.1038/mp.2013.184. View

Lemoine N, Hoffman A, Felton A, Baur L, Chaves F, Gray J . Underappreciated problems of low replication in ecological field studies. Ecology. 2016; 97(10):2554-2561. DOI: 10.1002/ecy.1506. View

Stoehr . Are significance thresholds appropriate for the study of animal behaviour?. Anim Behav. 1999; 57(5):F22-F25. DOI: 10.1006/anbe.1998.1016. View

Zollner S, Pritchard J . Overcoming the winner's curse: estimating penetrance parameters from case-control data. Am J Hum Genet. 2007; 80(4):605-15. PMC: 1852705. DOI: 10.1086/512821. View

Munafo M, Flint J . How reliable are scientific studies?. Br J Psychiatry. 2010; 197(4):257-8. DOI: 10.1192/bjp.bp.109.069849. View

Murtaugh P . In defense of P values. Ecology. 2014; 95(3):611-7. DOI: 10.1890/13-0590.1. View

Ioannidis J . Meta-research: The art of getting it wrong. Res Synth Methods. 2015; 1(3-4):169-84. DOI: 10.1002/jrsm.19. View

Gelman A, Robert C . Revised evidence for statistical standards. Proc Natl Acad Sci U S A. 2014; 111(19):E1933. PMC: 4024860. DOI: 10.1073/pnas.1322995111. View

10.

GREENWALD A, Gonzalez R, Harris R, GUTHRIE D . Effect sizes and p values: what should be reported and what should be replicated?. Psychophysiology. 1996; 33(2):175-83. DOI: 10.1111/j.1469-8986.1996.tb02121.x. View

11.

. PSYCHOLOGY. Estimating the reproducibility of psychological science. Science. 2015; 349(6251):aac4716. DOI: 10.1126/science.aac4716. View

12.

Button K, Ioannidis J, Mokrysz C, Nosek B, Flint J, Robinson E . Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci. 2013; 14(5):365-76. DOI: 10.1038/nrn3475. View

13.

Madden L, Shah D, Esker P . Does the P Value Have a Future in Plant Pathology?. Phytopathology. 2015; 105(11):1400-7. DOI: 10.1094/PHYTO-07-15-0165-LE. View

14.

Smaldino P, McElreath R . The natural selection of bad science. R Soc Open Sci. 2016; 3(9):160384. PMC: 5043322. DOI: 10.1098/rsos.160384. View

15.

Rothman K . Six persistent research misconceptions. J Gen Intern Med. 2014; 29(7):1060-4. PMC: 4061362. DOI: 10.1007/s11606-013-2755-z. View

16.

Greenland S, Senn S, Rothman K, Carlin J, Poole C, Goodman S . Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016; 31(4):337-50. PMC: 4877414. DOI: 10.1007/s10654-016-0149-3. View

17.

Cohen J . The statistical power of abnormal-social psychological research: a review. J Abnorm Soc Psychol. 1962; 65:145-53. DOI: 10.1037/h0045186. View

18.

Greenland S, Poole C . Living with statistics in observational research. Epidemiology. 2012; 24(1):73-8. DOI: 10.1097/EDE.0b013e3182785a49. View

19.

Drummond G . Most of the time, P is an unreliable marker, so we need no exact cut-off. Br J Anaesth. 2016; 116(6):894. DOI: 10.1093/bja/aew146. View

20.

Senn S . A comment on replication, p-values and evidence, S.N.Goodman, Statistics in Medicine 1992; 11:875-879. Stat Med. 2002; 21(16):2437-44; author reply 2445-7. DOI: 10.1002/sim.1072. View