When you can’t write down the distribution, resample it

foundations
statistics
Putting a confidence interval on a mean is a formula. Putting one on a median, a ratio, or some tangled business metric usually isn’t. The bootstrap sidesteps the whole problem by treating your sample as a stand-in for the population and resampling from it.
Author

Umut Altun

Published

January 15, 2024

Someone asks how confident you are in a number. If the number is a mean, you’re fine — there’s a clean formula for the standard error, you learned it years ago, you quote the interval and move on. But the numbers people actually ask about are rarely means. They’re medians, 90th percentiles, ratios of one sum to another, the output of some multi-step calculation that mangles the data in ways no textbook anticipated. And for most of those, the tidy standard-error formula either doesn’t exist or rests on assumptions you know are false for your data.

I spent longer than I’d like to admit, early on, hunting for the right closed-form expression for the uncertainty of some awkward metric — a ratio of two correlated quantities, I think — before I remembered that I didn’t need the formula at all. There’s a method that gives you the uncertainty of almost any statistic without you having to derive anything, and the idea behind it is so simple it sounds like cheating.

Here’s the problem stated cleanly. You want to know how much your statistic would wobble if you’d drawn a different sample from the population — that wobble is the uncertainty. The honest way to measure it would be to go collect hundreds of fresh samples from the population, compute your statistic on each, and look at how much the answers vary. You obviously can’t do that. You have one sample. That’s the whole constraint.

The bootstrap’s move is audacious: treat the sample you have as if it were the population, and draw your hundreds of fresh samples from it. You resample your own data, with replacement, to the same size — so each synthetic sample includes some of your data points twice, some not at all — and compute the statistic on each resample. The spread of those values approximates the spread you’d have seen from real fresh samples. Your one dataset, by being resampled, impersonates the population it came from.

# want: how much would this statistic vary across fresh samples?
# can't get fresh samples. so resample the one you have, with replacement.
estimates = []
for _ in range(10_000):
    resample = sample(data, n=len(data), replace=True)
    estimates.append(weird_statistic(resample))   # median, ratio, anything

lo, hi = percentile(estimates, [2.5, 97.5])        # a 95% interval, no formula

The reason this isn’t cheating is worth sitting with, because it sounds like it should be. Your sample, if it’s a fair draw, already carries the shape of the population inside it — its spread, its skew, its lumps. Resampling from it reproduces that shape over and over, and the variation you get across resamples genuinely reflects how an estimate built from data of this shape and size behaves. It’s not conjuring information from nothing; it’s extracting the uncertainty that was already implied by the sample you have. You’re reading out something your data always knew but you couldn’t write a formula for.

What makes it a tool I actually reach for, rather than a curiosity, is the weird_statistic line being a black box. The bootstrap does not care what’s inside. Mean, median, 90th percentile, a ratio of sums, a Gini coefficient, the result of fitting a small model and reading off one coefficient — anything you can compute on a sample, you can bootstrap, with the identical handful of lines. The annual cost of all the standard-error formulas I never have to derive, never have to check the assumptions of, never have to get subtly wrong, is most of why I like it. When the metric is gnarly, resampling is often less work and more trustworthy than finding and validating the analytic interval.

I’ll be square about the limits, because the bootstrap’s simplicity makes it easy to over-trust. It leans on your sample being a fair picture of the population — feed it a biased sample and it faithfully reports the uncertainty of a biased estimate, with no warning. It struggles at the very edges: bootstrapping the maximum of a distribution, or any statistic dominated by the rarest tail values, behaves badly, because your sample’s extremes are a poor stand-in for the population’s. And with very little data the whole act of impersonation gets shaky — a sample of eight doesn’t carry enough shape to resample convincingly. It’s a workhorse for the broad middle of problems, not a universal solvent.

But for that broad middle — a messy metric, a decent pile of data, a question of “how much should I trust this number” — the bootstrap is the first thing I reach for now, ahead of any search for the right formula. Uncertainty turned out not to be something I could only get by deriving it. Often I can just simulate it, by making the data pretend to be the world it was sampled from.1


From the statistical-methods foundations of my coursework — the technique I end up using more than any single formula. Code is illustrative.

Footnotes

  1. The same resampling instinct underlies permutation tests, which answer “could this difference be chance?” by shuffling labels instead of resampling rows — building the null distribution by brute force rather than assuming it. Bootstrap for confidence intervals, permutation for hypothesis tests; both replace “write down the distribution” with “generate it from the data you have.” Once that move is in your hands, a surprising number of statistics problems stop needing a formula at all.↩︎