On the "significance" of the p-value

On the “significance” of the p-value

A single number drives a large share of scientific publication decisions. What that number actually measures, why it becomes unreliable at scale, and how to correct for the cases when you compute many of them at once.

Javier Boix Campos

June 2026

Suppose you are reading a genomics paper. The authors screened 20,000 genes for differential expression between healthy and diseased tissue, and identified 1,300 genes with p < 0.05. The methods are careful. The conclusions are confident. But here is the calculation the abstract does not include: under the null hypothesis, with every single one of those 20,000 genes truly unaffected by disease, you would still expect 20,000 × 0.05 = 1,000 genes to appear significant by pure chance. The study found 1,300. Only 300 more than what chance alone predicts.

This is not a failure of the scientists involved. It is a failure of interpretation, and it plays out across disciplines every day. The p-value is one of the most widely reported numbers in empirical research, and one of the most persistently misread. Understanding what it actually measures, what inflates it, and how to correct for the cases when many tests are run at once, turns out to matter a great deal.

What the p-value is (and is not)

The p-value has a precise definition. Given a null hypothesis \(H_0\) and a test statistic \(T\), the p-value is the probability of observing a value of \(T\) at least as extreme as the one measured, assuming the null hypothesis is true:

\[ p = P\!\left(T \geq t_{\text{obs}} \mid H_0\right) \]

Two things follow immediately from this definition. First, p is computed by assuming \(H_0\) is true. It cannot, on its own, tell you the probability that \(H_0\) is true. These two quantities are related by Bayes’ theorem, and translating between them requires a prior probability on \(H_0\) that is almost never reported.1 Bayes’ theorem gives \(P(H_0 \mid \text{data}) \propto P(\text{data} \mid H_0) \cdot P(H_0)\). A p-value of 0.01 from a test of a prior improbable hypothesis (say, that a homeopathic dilution affects cancer cell growth) may still leave \(H_0\) as the most probable explanation, because the prior was so low. Conversely, a p of 0.03 in a well-motivated pharmacological trial against a plausible mechanism is stronger evidence than the number alone suggests. A small p-value means your data would be unusual if \(H_0\) were true. It does not mean \(H_0\) is unlikely.

Second, p depends on both the size of the effect and the size of the sample. This is not obvious from the definition but is critical in practice, and the rest of this article is largely about its consequences.

The damage comes not from misreading the formula but from reducing the result to a binary. Below 0.05, the result is real; above it, there is nothing to discuss. That reflex strips away everything the p-value actually encodes: how far below the threshold, with what sample size, against what prior expectation. A p of 0.049 and a p of 0.0001 are treated as equivalent because both cleared the bar, while a p of 0.06 is dismissed despite being nearly indistinguishable from the first. And because p accumulates evidence with sample size regardless of how small the true effect is, the same underlying difference can produce any p-value you like, given enough data.

The 0.05 threshold that governs so much of scientific publishing is not a law of nature.2 Ronald Fisher, in his 1925 Statistical Methods for Research Workers, described 0.05 as a convenient rule of thumb for deciding when to investigate further, not as a universal decision criterion. Neyman and Pearson later gave it a more formal basis in terms of long-run error rates, but the threshold remained a pragmatic convention. In 2016, the American Statistical Association issued an explicit statement cautioning against binary “significant / not significant” thinking and calling for more nuanced reporting (Wasserstein & Lazar, 2016). It originated as a convenient convention, was adopted as a publication filter, and became self-reinforcing once journals began treating it as a criterion for interest. The result is a literature that is systematically biased toward studies that cleared a threshold once, at a particular sample size, on a particular day.

The sample size problem

For any fixed, non-zero effect, the p-value will eventually fall below 0.05 as sample size grows. This is not a flaw that appears in exceptional circumstances. It is a mathematical certainty.

In a two-sample comparison, the test statistic z summarises how many standard errors separate the two group means. It scales with the square root of the sample size: for n observations per group and a true standardised difference d between groups (Cohen’s d: the difference in means divided by the pooled standard deviation), z is approximately:

\[ z \approx d \cdot \sqrt{\frac{n}{2}} \]

The p-value is then read off the tail of the standard normal distribution. As n grows, z grows, and p shrinks toward zero regardless of how small d is. The minimum sample size needed to reach \(p = 0.05\) follows directly:

\[ n^* \approx 2\left(\frac{z_{\alpha/2}}{d}\right)^2 \]

For \(\alpha = 0.05\), \(z_{\alpha/2} = 1.96\). A medium effect (\(d = 0.5\)) needs roughly 31 observations per group. A tiny effect (\(d = 0.05\)) needs roughly 3,000. The effect size determines how much data is required, not whether significance is eventually reached.

Figure 1 plots the p-value as a function of sample size for a fixed effect. The curve’s crossing point with \(p = 0.05\) is the only thing the threshold actually tells you: how much data was needed to clear it. A drug that reduces a biomarker by 0.3% in a trial of 500,000 participants will report p < 0.0001. The effect is real in the narrow statistical sense. Whether a 0.3% change is biologically or clinically meaningful is a completely separate question, one that p cannot answer.

d = 0.20
n per group for p < 0.05
p-value at n = 50
p-value at n = 500

Figure 1. The p-value as a function of sample size for a fixed effect. The dashed line marks \(p = 0.05\). For any non-zero \(d\), the curve inevitably crosses the threshold; the question is only how much data is needed. At d = 0.05, a barely perceptible difference, the crossing point moves past n = 3,000. That is the scale of data collection required before the evidence even clears the conventional threshold, which puts into perspective why large studies so often report highly significant results for effects that resist practical interpretation.

p as a function of n

The dependence of p on sample size is usually treated as a nuisance to be aware of. Gómez-de-Mariscal et al. (2021) propose treating it as information instead. Rather than computing a single p-value from the full dataset, their framework models how p behaves across subsamples of increasing size. The relationship follows an exponential decay:

\[ p(n) = a \cdot e^{-cn} \]

The decay rate c carries the meaningful information: a large c means the effect is detectable in modest subsamples; a small c means only a very large accumulation of data forces the curve below the threshold.

The crossing point with \(p = 0.05\), which Gómez-de-Mariscal et al. call \(n_\alpha\), functions as a natural effect size indicator. A small \(n_\alpha\) means the effect was detectable in a modest experiment. A large \(n_\alpha\) means you needed a vast amount of data to find it, which is worth knowing about any claimed discovery.

The figure below shows two studies, both of which eventually reach p < 0.05. Their p-value curves tell entirely different stories.

d = 0.65
d = 0.10
Study A: nα
Study B: nα
Study A
Study B

Figure 2. Two p(n) curves. Both will reach p < 0.05, but at very different sample sizes. The crossing point \(n_\alpha\) summarises the strength of the evidence independent of total dataset size.

The multiple testing problem

The genomics example from the opening illustrates a second, compounding problem. When you test many hypotheses simultaneously, the probability of obtaining at least one false positive grows rapidly with the number of tests. For \(m\) independent tests conducted at level \(\alpha\), the probability of at least one false positive across the whole family is:

\[ P(\text{at least one false positive}) = 1 - (1 - \alpha)^m \]

At \(\alpha = 0.05\) and \(m = 100\), this equals 0.994. You are almost certain to find at least one significant result even when nothing is true. At \(m = 1{,}000\), it rounds to 1. In genome-wide association studies, where a million or more genetic variants are tested simultaneously, the standard significance threshold is not 0.05 but \(5 \times 10^{-8}\), precisely because this quantity needs to stay controlled.3 The threshold \(5 \times 10^{-8}\) corresponds roughly to a Bonferroni correction for one million independent tests at \(\alpha = 0.05\). In practice, SNPs are correlated (linkage disequilibrium), so the effective number of independent tests is lower, but the convention has become standard in the field.

This is the family-wise error rate (FWER): the probability of making at least one false positive across a family of tests. Controlling a single-test alpha provides no protection for FWER once you run many tests.

The figure below makes this concrete. All 100 hypotheses are null. There are no real effects. The highlighted squares represent tests that happened to return p < \(\alpha\) by chance.

α = 0.05
Flagged significant
Expected under null
5.0

Figure 3. One hundred independent tests, all under the null hypothesis. Highlighted squares indicate p < \(\alpha\). Every highlighted square is a false positive. Click “Resample” to draw fresh p-values; the count changes each time, averaging around \(100 \times \alpha\).

The false positives are not random mistakes. Their expected count is exactly \(100 \times \alpha\), a direct consequence of the threshold applied repeatedly, not of anything going wrong in the analysis. Raising \(\alpha\) to 0.10 doubles that count just as predictably.

Corrections: Bonferroni and Benjamini-Hochberg

Two corrections dominate practice. They differ in what they control and what they cost.

Bonferroni correction. The simplest approach: declare a test significant only if \(p < \alpha / m\), where \(m\) is the number of tests. With 20 tests and \(\alpha = 0.05\), the threshold becomes 0.0025. This controls the FWER directly: the probability of any false positive across the full family stays below \(\alpha\). The cost is power. Bonferroni is conservative, especially when tests are correlated, and will miss real effects that a less stringent procedure would catch.

Benjamini-Hochberg (BH) procedure. A less conservative alternative that controls the false discovery rate (FDR): the expected fraction of discoveries that are false, rather than the probability of any single false positive.4 FDR control, introduced by Benjamini and Hochberg (1995), is a weaker guarantee than FWER control. If BH controls FDR at level \(q\), it means that among all rejected hypotheses, on average no more than a \(q\) fraction are false positives. In a genomics screen returning 100 hits at \(q = 0.05\), you expect about 5 to be false. This is more tolerable than demanding zero false positives when searching for novel biology. Sort p-values in ascending order as \(p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}\). Find the largest rank \(k\) for which:

\[ p_{(k)} \leq \frac{k}{m} \cdot \alpha \]

Reject all hypotheses with ranks 1 through \(k\). The procedure allows the significance threshold to rise with rank, which recovers tests that Bonferroni would discard.

The figure below shows a simulated experiment with 20 tests: 5 with real effects and 15 under the null. The BH line rises from the origin at slope \(\alpha/m\). Points to the lower-right of the line survive BH. The Bonferroni threshold is a horizontal line.

α = 0.05
Tests surviving
True signal recovered
False positives
Signal, survives
Noise, survives (false positive)
Signal, removed (missed)
Noise, removed (correct)

Figure 4. Twenty ranked p-values: 5 from real effects (closed circles) and 15 from null tests (open circles). The diagonal BH line rises at slope \(\alpha/m\); the Bonferroni threshold is a fixed vertical line. Toggle between correction methods to see which tests survive. At \(\alpha = 0.05\), Bonferroni recovers 2 of 5 signals; BH recovers 4 of 5, with no false positives.

Which correction to use. Bonferroni is appropriate when any false positive carries a high cost: safety endpoints in confirmatory trials, or regulatory submissions where a single spurious finding has consequences. Benjamini-Hochberg is the standard in discovery settings, particularly in genomics and proteomics, where a controlled false discovery rate is acceptable and excessive conservatism means missing biology. The right choice is not statistical but scientific: it depends on the relative cost of a false positive versus a missed true positive in your specific context.

What to report alongside

The p-value answers one question: how inconsistent are these data with the null? Three additional numbers are needed to make that answer interpretable.

Effect size. Report a measure of how large the difference actually is. In a comparison of means, Cohen’s \(d\) standardises the difference by the pooled spread. In gene expression, fold change is more natural. In clinical outcomes, an odds ratio or number needed to treat connects the finding to practical decisions. A p-value alone, without an effect size, tells you almost nothing about whether a finding matters.

Confidence interval. A 95% confidence interval gives a range of effect sizes consistent with your data at the chosen level.5 The confidence interval is widely misread as “there is a 95% probability the true value lies in this range.” The correct frequentist interpretation is: if the experiment were repeated many times under identical conditions, 95% of the constructed intervals would contain the true parameter. In practice, reporting the interval is far more informative than reporting p alone, because it makes both the direction and magnitude of the effect visible. An interval that barely excludes zero, spanning nearly an order of magnitude, is a much weaker finding than one that excludes zero by a factor of ten and is tightly bounded.

The sample size context. For large datasets, consider reporting \(n_\alpha\): the minimum sample size at which the effect first reached statistical significance in the subsampling analysis of Gómez-de-Mariscal et al. (2021). An \(n_\alpha\) of 25 communicates that the effect is robust and replicable in small experiments. An \(n_\alpha\) of 50,000 communicates that the effect required an enormous accumulation of data to appear at all, and should be interpreted cautiously regardless of the final p-value. The code for this analysis is openly available at github.com/BIIG-UC3M/pMoSS.

A number in context

The p-value is a useful instrument with a narrow range. It answers, well and honestly, a single question: how often would data this extreme arise if the null hypothesis were true? It cannot tell you how large the effect is, whether it matters in practice, or what happens when you run the same analysis under slightly different conditions. Used together with effect sizes, confidence intervals, and corrections for multiplicity, it remains one of the most practical tools in empirical research.

The word “significance” carries weight that the number does not earn on its own. Statistical significance means inconsistency with a specific null at a specific threshold. Scientific significance means the finding is large enough, robust enough, and relevant enough to matter for understanding or action. The first is a gate. The second is the point.

Acknowledgements

The author gratefully acknowledges Estibaliz Gómez de Mariscal for her careful review and verification of the technical content of this article.

References

  1. 1. Fisher, R. A. (1925). Statistical Methods for Research Workers. Oliver and Boyd.
  2. 2. Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society A, 231, 289–337. doi:10.1098/rsta.1933.0009
  3. 3. Wasserstein, R. L., & Lazar, N. A. (2016). The ASA’s statement on p-values: context, process, and purpose. The American Statistician, 70(2), 129–133. doi:10.1080/00031305.2016.1154108
  4. 4. Gómez-de-Mariscal, E., Guerrero, V., Sneider, A., Jayatilaka, H., Phillip, J. M., Wirtz, D., & Muñoz-Barrutia, A. (2021). Use of the p-values as a size-dependent function to address practical differences when analyzing large datasets. Scientific Reports, 11, 20942. doi:10.1038/s41598-021-00199-5
  5. 5. Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1), 289–300. doi:10.1111/j.2517-6161.1995.tb02031.x
  6. 6. Bonferroni, C. E. (1936). Teoria statistica delle classi e calcolo delle probabilità. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze, 8, 3–62.
  7. 7. Amrhein, V., Greenland, S., & McShane, B. (2019). Scientists rise up against statistical significance. Nature, 567, 305–307. doi:10.1038/d41586-019-00857-9
  8. 8. Goodman, S. (2008). A dirty dozen: twelve p-value misconceptions. Seminars in Hematology, 45(3), 135–140. doi:10.1053/j.seminhematol.2008.04.003
  9. 9. Halsey, L. G., Curran-Everett, D., Vowler, S. L., & Drummond, G. B. (2015). The fickle P value generates irreproducible results. Nature Methods, 12(3), 179–185. doi:10.1038/nmeth.3288

Cite this article

Boix Campos, J. (2026). On the “significance” of the p-value. Vivum. https://vivum-pub.org/editorial/significance-p-value

@article{boixcampos2026pvalue,
  author  = {Boix Campos, Javier},
  title   = {On the ``significance'' of the p-value},
  journal = {Vivum},
  year    = {2026},
  url     = {https://vivum-pub.org/editorial/p-value-significance},
}