Table of Contents >> Show >> Hide
- The “Bright Line” Problem: Why One Number Became a Gatekeeper
- Quick Refresher: What a P-value Is (and What It Absolutely Is Not)
- Why 0.05 Became the Default (and Why It’s Not a Law of Physics)
- The Case for 0.005: “Raise the Bar” to Reduce False Positives
- The Case Against 0.005: The Cure Can Have Side Effects
- The Real Villain: “Statistically Significant” as a Final Verdict
- A Better Approach: Treat Evidence Like a Spectrum (Because It Is)
- 1) Report effect sizes and confidence intervals (not just p-values)
- 2) Pre-specify primary outcomes and analysis plans
- 3) Adjust for multiple comparisons when you test many things
- 4) Use “suggestive,” “moderate,” and “strong” evidence language carefully
- 5) Make replication part of the story, not a scandal
- Concrete Example: When “Significant” and “Useful” Are Not the Same
- So… 0.05 or 0.005?
- Conclusion: Stop Asking P-values to Do a Job They Were Never Hired For
- Field Notes: Experiences Researchers Share from the 0.05 vs 0.005 Trenches
There are few things in science as powerful as a tiny decimal point. Put it in the right place and you get funding, headlines,
tenure, and a victory lap on social media. Put it in the wrong place and you get, “Interesting… but not significant.”
Welcome to the long-running debate over whether the traditional p-value threshold should stay at 0.05
or be tightened to 0.005.
If this sounds like statisticians arguing about the correct number of angels that can dance on the head of a pin, think again.
The 0.05 vs 0.005 question affects what gets published, what gets replicated, what gets translated into policy,
and what quietly disappears into the “we tried” folder. And it matters because the real world has a nasty habit of not caring
whether your results were “statistically significant” it cares whether they were true and useful.
The “Bright Line” Problem: Why One Number Became a Gatekeeper
In many fields, p < 0.05 became a kind of scientific bouncer: you’re either on the list (publishable)
or you’re not (good luck with Reviewer 2). That bright line is convenient, easy to teach, and easy to mis-use.
It encourages a world where p = 0.049 is treated like a triumph and p = 0.051 is treated like a shrug
even if the two studies are basically identical twins wearing different hats.
The pushback isn’t new. Professional statisticians have repeatedly emphasized that p-values are widely misunderstood and that
research shouldn’t be reduced to “significant / not significant.” But convenience is a powerful drug, and 0.05 has had a long
reign as the default “proof-ish” number.
Quick Refresher: What a P-value Is (and What It Absolutely Is Not)
A p-value is calculated under a statistical model (often a “null hypothesis” model) and answers a specific question:
If the null model were true, how surprising would my data (or something more extreme) be?
Small p-values suggest the data are less compatible with the null model. That’s it. That’s the whole job description.
Common Misinterpretations That Won’t Die
- “p = 0.03 means there’s a 3% chance the null is true.”
Nope. A p-value is not the probability a hypothesis is true or false. - “p < 0.05 means the result is important.”
Nope. Statistical significance is not the same as practical or clinical significance. - “p > 0.05 means there’s no effect.”
Nope. “Not significant” can mean low power, noisy data, or an effect that’s smaller than your study could reliably detect.
These misunderstandings matter because they turn a tool for measuring model-data compatibility into a shortcut for truth.
And shortcuts are how you end up confidently wrong with a statistically significant smile.
Why 0.05 Became the Default (and Why It’s Not a Law of Physics)
The 0.05 convention is partly historical. It’s been popularized for about a century as a workable default for controlling
Type I error (false positives) in many common testing setups. But “common” does not mean “correct for every scenario.”
A genome-wide association study, a clinical trial with one primary endpoint, and a psychology experiment with many outcomes
do not share the same risk profile even if they all share the same decimal point.
The uncomfortable truth: 0.05 is not magic. It’s a convention. And conventions can be updated when the costs
of following them become obvious.
The Case for 0.005: “Raise the Bar” to Reduce False Positives
The strongest argument for tightening the default threshold is simple: p < 0.05 is often weak evidence,
especially when you consider publication bias, flexibility in analysis, and the reality that many hypotheses tested in
exploratory research are wrong. In that world, letting “significance” start at 0.05 can lead to a lot of false alarms.
Advocates of 0.005 argue that a stricter cutoff for claiming a “new discovery” would reduce the rate of false
positives and improve reproducibility. In many proposals, results between 0.05 and 0.005 wouldn’t be tossed in the trash
they’d be treated as suggestive evidence rather than a mic-drop conclusion.
A Simple Illustration: Why 0.005 Can Be Costly (But Also Clarifying)
Tightening the threshold doesn’t just change labels. It changes design. If you want to keep the same power (say, 80%)
while lowering alpha, you usually need more data.
Here’s a simplified back-of-the-envelope example used in many sample-size discussions. For a two-sided test, the critical z-value
is roughly 1.96 for alpha = 0.05 and 2.81 for alpha = 0.005. If you aim for 80% power,
z for beta is about 0.84. Sample size is proportional to:
(zalpha + zbeta)2
So the ratio is approximately:
(2.81 + 0.84)2 / (1.96 + 0.84)2 = 3.652 / 2.82 ≈ 1.70
Translation: in many straightforward settings, moving from 0.05 to 0.005 can require roughly 70% more participants
to keep power similar. That’s not always possible but it does force an honest conversation about what you’re willing to claim
given the evidence you can actually afford to collect.
The Case Against 0.005: The Cure Can Have Side Effects
Critics aren’t defending 0.05 because they’re emotionally attached to that particular number (although, sure, some people have
matching “p<0.05” mugs). They worry that a stricter threshold can create new problems:
1) More false negatives, fewer real findings
If you raise the bar without increasing sample sizes, you’ll miss effects that are real but modest especially in fields with
inherently noisy measurements or hard-to-recruit populations. That can slow progress or bias research toward only “big, easy”
effects.
2) Bigger inequities between “rich data” and “poor data” science
Some disciplines can scale sample sizes cheaply (online experiments, certain administrative datasets). Others can’t
(rare diseases, specialized lab experiments, longitudinal fieldwork). A universal shift to 0.005 risks privileging the fields
with inexpensive data and punishing the ones where data collection is costly or ethically constrained.
3) A new magic number doesn’t fix the old habits
A stricter threshold can still be gamed. If researchers are incentivized to chase a cutoff, they may simply chase
0.005 instead of 0.05. You don’t eliminate questionable research practices by changing the goalpost;
you eliminate them by changing incentives and improving transparency.
The Real Villain: “Statistically Significant” as a Final Verdict
Many statisticians argue the bigger issue is not whether the threshold is 0.05 or 0.005 it’s the idea that
any single threshold should function like a courtroom verdict. Evidence is not binary. Uncertainty isn’t a light switch.
That’s why you’ll see strong arguments for moving beyond “statistical significance” language altogether and focusing instead on:
effect sizes, uncertainty intervals, study design quality, prior evidence, and how results behave across replications.
A Better Approach: Treat Evidence Like a Spectrum (Because It Is)
If you’re tired of the threshold wars, good. That means you’re ready for the practical middle path one that improves decisions
whether you personally love 0.05, prefer 0.005, or want to yeet the whole concept into the sun.
1) Report effect sizes and confidence intervals (not just p-values)
A p-value can’t tell you how large or meaningful an effect is. Always pair p-values with effect sizes and interval estimates
so readers can judge practical significance. A tiny effect with p<0.001 can still be trivial especially in huge datasets.
2) Pre-specify primary outcomes and analysis plans
One reason false positives thrive is analytic flexibility: trying multiple models, outcomes, or subgroups and reporting the
best-looking result. Pre-specifying the plan (and sticking to it) reduces “researcher degrees of freedom” and makes your
inferential claims more credible.
3) Adjust for multiple comparisons when you test many things
If you test 20 independent hypotheses at alpha = 0.05, you should expect about one false positive by chance alone.
That’s not a scandal; it’s math. Use appropriate corrections (or false discovery rate approaches) when the design requires it.
4) Use “suggestive,” “moderate,” and “strong” evidence language carefully
Some proposals recommend classifying results rather than declaring victory. For example:
p < 0.005 as strong evidence for new claims, 0.005–0.05 as suggestive, and >0.05 as weak evidence given the model.
This can be more honest than pretending there’s a cliff at 0.05.
5) Make replication part of the story, not a scandal
A single study rarely settles anything. Replication and triangulation across methods are what turn a statistical finding into
a reliable scientific claim. Building systems that reward replication can do more for truth than changing a threshold ever will.
Concrete Example: When “Significant” and “Useful” Are Not the Same
Imagine you run an A/B test for a new app onboarding flow on 2 million users. You find a statistically significant improvement
in click-through rate: +0.08%, p < 0.001.
Is the new flow better? Maybe. Is it meaningfully better? That depends on costs, user experience, long-term retention,
and whether the change matters to real business or human outcomes. With huge samples, tiny differences become statistically
detectable. That’s why practical significance must be evaluated alongside statistical evidence.
Flip the script: a small clinical trial finds a potentially life-improving change with p = 0.06 and a wide confidence interval.
Is that “nothing”? Not necessarily. It may be a strong signal hampered by limited power something that deserves follow-up,
not dismissal.
So… 0.05 or 0.005?
If you want the most honest answer: it depends on context, prior evidence, consequences of being wrong,
feasibility of collecting more data, and how much analytic flexibility exists in your pipeline.
For high-stakes “new discovery” claims where false positives are costly and replication is rare, a stricter evidentiary bar
(whether 0.005 or something else) can make sense. For exploratory work, early-phase research, or rare-population studies,
forcing everything through a 0.005 filter can be counterproductive.
The goal isn’t to worship a number. The goal is to align statistical practice with reality: uncertainty, tradeoffs, and the
messy business of learning from data.
Conclusion: Stop Asking P-values to Do a Job They Were Never Hired For
The p-value is a tool not a truth machine, not a “publish” button, not a morality test for researchers. The ongoing debate
over 0.05 vs 0.005 is really a debate about scientific culture: incentives, reproducibility, and how we communicate uncertainty.
If the “p-value wars” have a silver lining, it’s this: more people are finally asking better questions. Not just
“Is it significant?” but “How big is the effect?”, “How certain are we?”, “What would replication likely show?”, and
“What are the consequences if we’re wrong?”
That’s the kind of progress you can’t summarize with a single decimal point and that’s a very good thing.
Field Notes: Experiences Researchers Share from the 0.05 vs 0.005 Trenches
If you’ve ever lived through a “p-value moment,” you already know the feeling: your analysis finishes running, you squint at
the output, and the number stares back like a movie villain. Maybe it’s 0.049 and you feel relief not because the effect
suddenly became real, but because you can already hear the journal editor saying “acceptable.” Or maybe it’s 0.051 and your
stomach sinks, even though nothing meaningful changed between those two universes. That emotional whiplash is exactly why the
threshold debate refuses to die: it’s not just technical, it’s human.
One common experience in many labs is the “analysis garden path.” A team starts with a clean hypothesis, then reality intrudes:
an outlier looks suspicious, one subgroup behaves strangely, a covariate “should probably be included,” a transformation makes
residuals look nicer, and suddenly you’ve wandered from a straightforward plan into a maze of reasonable-sounding choices.
None of those choices are automatically wrong but each one quietly increases the chance of stumbling onto a flattering p-value.
Later, someone tries to replicate the study and can’t reproduce the original magic because the original analysis contained a
dozen tiny forks in the road that were never fully documented. If you’ve seen this once, you’ve seen it a hundred times.
It’s not fraud; it’s flexibility plus incentives.
Another familiar story shows up in big-data environments: the “everything is significant” problem. Analysts run models on huge
datasets and discover that practically every coefficient is statistically significant. The temptation is to treat significance
like importance, but the day-to-day experience teaches the opposite. The big wins usually come from effect sizes and stability:
What changes decisions? What survives reasonable alternate specifications? What holds up over time? In these settings, a switch
from 0.05 to 0.005 might change some labels, but it doesn’t solve the deeper issue that giant samples can make tiny, irrelevant
effects look impressive. The only real cure is judgment, context, and better reporting.
Clinical and high-stakes research brings a different kind of lived experience: power constraints. Teams working with rare
conditions or difficult recruitment often face an uncomfortable choice: either run the study with limited sample sizes and accept
that p-values will be noisy, or wait years (or forever) to get the “perfect” dataset. In those circles, the 0.005 proposal can
feel like a luxury policy written for fields with cheap participants and endless budgets. Yet those same teams also know the pain
of false positives a promising early result that doesn’t replicate can waste resources and, in the worst cases, mislead clinical
decisions. Their experience often leads to a pragmatic stance: tighten standards for decisive claims, but don’t treat “not below
the threshold” as the same thing as “no signal.”
Across disciplines, the most productive “war stories” tend to end with process upgrades rather than number worship.
Teams adopt preregistration for confirmatory analyses, label exploratory work honestly, and shift discussions toward effect sizes,
uncertainty, and replication plans. Reviewers and editors who focus on clarity (“What was planned?” “What changed?” “How robust is
this?”) often get better science than those who focus on the sacred cutoff. In practice, the best researchers learn to treat
p-values as one clue among many like a smoke alarm that’s useful when it beeps, but not a substitute for checking whether your
kitchen is actually on fire.
And maybe that’s the most relatable takeaway from the trenches: the debate isn’t really about whether 0.05 is “wrong” or 0.005
is “right.” It’s about escaping the idea that a single number can shoulder the entire burden of scientific proof. Once you’ve
watched a borderline p-value reshape a narrative, rewrite an abstract, or decide the fate of a project, it becomes obvious that
what science needs most is not a new bright line but better habits, better incentives, and a healthier relationship with
uncertainty.
