Visualizing the Kolmogorov-Smirnov Test

Suppose you have developed a diagnostic test to screen for a disease from blood results. Before deploying it, you want to know whether it actually separates patients who carry the condition from those who don't. Or suppose the test has been running in a clinic for six months and you want to check whether the patients arriving now look statistically similar to those it was trained on. Or perhaps you are simply checking whether your measurements are roughly normally distributed before applying a method that assumes normality.

These are all, at their core, the same kind of question: are these two distributions the same? The Kolmogorov-Smirnov test11 The one-sample statistic was introduced by Kolmogorov (1933); Smirnov (1948) extended it to the two-sample case used here. Both are collected under Reference 1. is one of the most widely used tools for answering it. It is simple, non-parametric (it makes no assumptions about the shape of the distributions), and it has an intuitive geometric interpretation, which means we can understand it visually.

The Empirical Cumulative Distribution Function

Before we can define the KS statistic, we need to understand the empirical cumulative distribution function, or ECDF. Given a sample of \(n\) observations, the ECDF is a step function that rises by \(1/n\) at each data point. For any value \(x\), it tells you what fraction of your sample falls at or below \(x\).

The ECDF is the simplest possible estimate of a distribution's shape. No binning, no bandwidth choices, no smoothing parameters. Just the data, ordered and counted. This is what makes the KS test distribution-free: it works on the ECDF directly, so it doesn't care whether your data is Gaussian, exponential, or something you've never seen before.22 Formally, the Glivenko–Cantelli theorem guarantees that the ECDF converges uniformly to the true CDF as n → ∞ with probability 1.

Sample A

Sample B

Figure 1. Two empirical CDFs. Each step function shows the fraction of observations at or below a given value. The shaded region between them is the gap that the KS statistic measures.

The KS Statistic

The KS statistic \(D\) is the maximum vertical distance between two ECDFs:

D = \max_x \left| \hat{F}_1(x) - \hat{F}_2(x) \right|

That's it. You lay two step functions on top of each other and find the point where they're farthest apart. If the two samples come from the same distribution, you'd expect their ECDFs to track each other closely, and \(D\) to be small. If they come from different distributions, the ECDFs will diverge somewhere, and \(D\) will be large.

There's something satisfying about this. The KS statistic doesn't average over the whole distribution, or summarize the difference in means or variances. It finds the single point of maximum disagreement: the worst case. This makes it sensitive to differences in shape, not just location. Two distributions with the same mean but different variances, or the same variance but different skew, will still be caught by the KS test.

In the figure below, try changing the distributions and watch how the gap responds. Pay attention to where the maximum gap occurs: it's often not where you'd intuitively guess.

Distribution A

Distribution B

KS gap (D)

Distribution A

Distribution B

Sample size n = 150

D statistic

—

Figure 2. The KS statistic in action. The shaded region shows the gap between the two ECDFs; the dashed line marks where it's widest. Change the distributions to see how \(D\) responds.

The p-value

By itself, \(D\) is just a distance. To turn it into a statistical test, we need to ask: if the two samples really did come from the same distribution, how likely is it that we'd observe a gap this large or larger just by chance? That probability is the p-value.

The formula for the approximate p-value of the two-sample KS test is:

p \approx 2 \exp\!\left(-2 \left[\left(\sqrt{\frac{n_1 n_2}{n_1+n_2}} + 0.12 + \frac{0.11}{\sqrt{\frac{n_1 n_2}{n_1+n_2}}}\right) D\right]^2\right)

The key insight is that the critical value of \(D\) shrinks as sample size grows. With tiny samples, a large gap might just be noise; with thousands of observations, even a small gap can be statistically significant. This is important, and it's why you should never interpret \(D\) in isolation from your sample size.33 The correction factors 0.12 and 0.11 in the p-value formula are from Press et al., Numerical Recipes, 3rd ed. (2007), §14.3.

A note on significance. A small p-value (typically < 0.05) tells you that the observed gap is unlikely under the null hypothesis that the distributions are identical. But statistical significance is not practical significance. With enough data, even trivially small differences become “significant.” The p-value answers is there a difference?, not does the difference matter?

Interactive Playground

The best way to build intuition is to experiment. Below, you can choose two distributions, tune their parameters, adjust sample sizes, and see the ECDFs, \(D\), and the p-value update in real time. Hit “Resample” to draw fresh random samples with the same settings; notice how \(D\) fluctuates from sample to sample, especially with small \(n\).

Try this: set both distributions to Normal with the same parameters and click Resample a few times. You'll see \(D\) bounce around even though the true distributions are identical. Now change one parameter slightly and watch what happens.

Distribution A

Distribution B

KS gap (D)

Distribution A

Param 1 μ = 0.0

Param 2 σ = 1.0

Sample size A n = 200

Distribution B

Param 1 μ = 0.0

Param 2 σ = 1.0

Sample size B n = 200

D statistic

—

Approx. p-value

—

Figure 3. Full interactive playground. Adjust distributions, parameters, and sample sizes. The shaded region and dashed line show the KS gap. The p-value tells you how surprising that gap would be if the distributions were truly identical.

What Makes a Good KS Value?

This is the part that trips people up. There is no universal answer. A KS value of 0.35 can be excellent in one context and alarming in another. The number is the same; what changes is what you're using it for.

The reason is simple: in some settings, a large gap between distributions is exactly what you want (your model successfully separates two groups). In others, a large gap is exactly what you don't want (your production data has drifted from your training data). And in yet others, you're simply checking whether a distributional assumption holds, and any gap at all is bad news.

Classification & Separation

You have a diagnostic test and want to know whether it reliably separates two groups, say patients who carry a condition from those who don't. Higher is better.

No separationStrong separation

KS = 0.35 · Reasonable

Between 0.20 and 0.40, the test has useful discriminatory power. Above 0.50 it reliably distinguishes the groups. Near zero, the scores of the two populations overlap entirely and the test provides no signal.

Distribution Shift

A model trained last year is running in production. Each week you compare incoming data against the training set to check that inputs haven't fundamentally changed. Lower is better.

StableShifted

KS = 0.35 · Shift detected

Below 0.10 the inputs are stable. Above 0.10 a meaningful shift is worth investigating. At 0.35 the two distributions look substantially different; the model may be operating on data unlike anything it has seen before.

Goodness-of-Fit

Before applying a method that assumes a particular distribution, you check how closely your data matches it. Closer to 0 is better.

Close fitPoor fit

KS = 0.35 · Poor fit

A KS of 0.35 means the largest gap between your data's ECDF and the theoretical CDF reaches 35 percentage points. That is a substantial disagreement; the distributional assumption does not hold well for this data.

The lesson: the KS statistic is a pure measure of distance. Whether that distance signals success or raises a warning depends entirely on what the comparison is for.

When to Use the KS Test (and When Not To)

The KS test is most useful when you want a general test for distributional difference, when you don't have a specific alternative hypothesis in mind. Because it's sensitive to any kind of difference (location, scale, shape), it's a good first-pass check.

But this generality comes at a cost. The KS test is often less powerful than tests tailored to a specific alternative. If you specifically want to detect a shift in means, a t-test will generally be more sensitive, particularly when the data is approximately normal or the sample is large. If you want to detect a difference in variance, an F-test is better. The KS test trades sensitivity for breadth: it can detect any difference, but it's not the best at detecting particular differences.

The KS test also assumes continuous distributions. If your data has many tied values (as is common with discrete or ordinal data), the test becomes conservative and will fail to detect real differences more often than it should. And in its one-sample form, where you test your data against a specific theoretical CDF, the p-values are only valid when that distribution's parameters are fully known in advance. If you estimate μ and σ from the same sample and then run the KS test, the p-values are anticonservative, meaning the test rejects the null too often, inflating the Type I error rate (the Lilliefors correction addresses this for the normal case).44 Lilliefors (1967) derived simulation-based critical values for the case where μ and σ are estimated from the same sample, making the standard KS table anticonservative.

Despite these caveats, the KS test remains one of the most intuitive and widely applicable tools in statistics. Its geometric interpretation (the maximum gap between two step functions) is something you can see and feel. And now that you've played with it above, you do.

References

1. Kolmogorov, A. N. (1933). Sulla determinazione empirica di una legge di distribuzione. Giornale dell'Istituto Italiano degli Attuari, 4, 83–91. — Smirnov, N. V. (1948). Table for estimating the goodness of fit of empirical distributions. Annals of Mathematical Statistics, 19(2), 279–281.
2. Glivenko, V. (1933). Sulla determinazione empirica delle leggi di probabilità. Giornale dell'Istituto Italiano degli Attuari, 4, 92–99.
3. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (2007). Numerical Recipes: The Art of Scientific Computing, 3rd ed. Cambridge University Press. §14.3.
4. Lilliefors, H. W. (1967). On the Kolmogorov-Smirnov test for normality with mean and variance unknown. Journal of the American Statistical Association, 62(318), 399–402.

Cite this article

Boix Campos, J. (2026). The Kolmogorov-Smirnov Test. Vivum. https://vivum-pub.org/editorial/kolmogorov-smirnov-test

@article{boixcampos2026kstest,
  author  = {Boix Campos, Javier},
  title   = {The Kolmogorov-Smirnov Test},
  journal = {Vivum},
  year    = {2026},
  url     = {https://vivum-pub.org/editorial/kolmogorov-smirnov-test},
}