Researchers should stop describing results as “statistically significant” simply because they pass an “arbitrary” probability threshold, an influential journal has urged.
An in a special issue of The American Statistician says there should be an end to the practice of using “p-values” to validate the significance of results.
P-values are often used to show the probability that a particular result could have happened for a reason other than the one hypothesised – the “null hypothesis”. If the likelihood of a result’s occurring because of this null hypothesis is less than 5?per cent – a p-value of 0.05 – this is often deemed statistically significant and sometimes taken as strong evidence that the original hypothesis is true.
However, critics have increasingly been taking issue with such an approach, arguing that statistical significance is not the same as conclusive proof.
The issue goes to the heart of the debate on the reproducibility of research, with concerns that as well as statistical significance being misinterpreted, some scholars are even using p-values to essentially trawl for any results that pass the threshold.
In the special issue of The American Statistician – “Statistical Inference in the 21st Century: A?World Beyond P<0.05” – dozens of academics explore the issues surrounding the use of p-values and how researchers should properly interpret scientific results.
Writing in the issue’s editorial, statisticians including Ronald Wasserstein, executive director of the American Statistical Association, and Nicole Lazar, professor of statistics at the University of Georgia, say they have concluded that “it is time to stop using the term ‘statistically significant’ entirely” because it has become “meaningless”.
“狈辞 p-value can reveal the plausibility, presence, truth, or importance of an association or effect. Therefore, a?label of statistical significance does not mean or imply that an association or effect is highly probable, real, true, or important. Nor does a label of statistical non-significance lead to the association or effect being improbable, absent, false, or unimportant,” they write.
“For the integrity of scientific publishing and research dissemination, therefore, whether a p-value passes any arbitrary threshold should not be considered at all when deciding which results to present or highlight.”
Instead, the authors point to approaches suggested in many of the 43 papers published in the special issue, including properly setting out the context of any research, being honest about the limitations of a statistical analysis and using other methods that can be “complementary” to p-values.
The authors of the editorial accept that the scientific community may?be unlikely to converge on one “simple paradigm” for testing statistics and indeed “may never do so”, but they add that “solid principles for the use of statistics do exist, and they are well explained in this special issue”.
As well as focusing on the use of p-values, articles in the issue also criticise the incentives embedded in current scientific culture – such as the assessment of academics’ performance using metrics –?which many scholars believe are behind the incorrect use of statistics.
In one paper, David Colquhoun, emeritus professor of pharmacology at UCL, says that “in the end, the only way to solve the problem of reproducibility is to do more replication and to reduce the incentives that are imposed on scientists to produce unreliable work. The publish-or-perish culture has damaged science, as has the judgment of their work by silly metrics.”