Ridge regression is a prior wearing a frequentist disguise

foundations

statistics

Bayesian

The penalty you add to a regression to stop it overfitting isn’t an ad-hoc trick. It’s a prior belief about the coefficients, written in different notation — and seeing it that way tells you what you’re actually assuming.

Author

Umut Altun

Published

December 10, 2024

For a long time I held two ideas in separate boxes. In one box: regularization — the thing you do to a regression to stop it overfitting, where you add a penalty on the size of the coefficients and tune its strength with cross-validation. A practical trick, frequentist, mechanical. In the other box: Bayesian priors — beliefs about parameters you state before seeing data, philosophical, a different culture entirely. It took someone pointing it out before I saw that the two boxes contain the same object.

Ridge regression — least squares with a penalty on the sum of squared coefficients — is exactly the same computation as finding the most probable coefficients under a Gaussian prior centred at zero. Not similar, not analogous: the same optimisation problem, the same answer, written twice. The penalty strength and the prior’s tightness are the same knob in two notations. Lasso, with its penalty on absolute coefficient values, is the same story with a different prior — a Laplace distribution, which is spikier at zero and explains, in one line, why lasso drives coefficients exactly to zero while ridge only shrinks them toward it. The frequentist “penalty term” and the Bayesian “prior” are the same act of telling the model what to believe before it looks at the data.

Once you see it, you can’t unsee it, and it reorganises a lot. Adding an L2 penalty is saying “I believe these coefficients are probably small, clustered around zero.” That’s not a neutral, assumption-free safeguard — it’s a belief, as opinionated as any Bayesian prior, you were just expressing it through the back door of an optimisation penalty instead of the front door of a probability distribution. I made this same argument about marketing mix models — that ordinary least squares and ridge both encode beliefs, they just don’t admit it — and this is the general version. There is no such thing as an unopinionated model. There’s only models whose opinions are stated openly and models whose opinions are smuggled in as defaults.

minimise:  ‖y − Xβ‖²  +  λ‖β‖²          # ridge: a penalty you tune
        =  (least squares)  +  (penalty)

equivalently, maximise the posterior under:
    y | β  ~  Normal(Xβ, σ²)             # the likelihood (the fit)
    β      ~  Normal(0, τ²)              # a PRIOR: "coefficients are ~small"
    with   λ  ∝  σ²/τ²                   # penalty strength = prior tightness

That λ ∝ σ²/τ² is the whole bridge in one line. A bigger penalty is a tighter prior — a stronger prior belief that coefficients are near zero. A penalty of zero is an infinitely wide prior: no belief at all, pure least squares, free to overfit. When you cross-validate to “tune the regularization strength,” you are, in Bayesian terms, letting the data tell you how strong your prior should have been. The two cultures are solving one problem and arguing about the vocabulary.

Why does this matter beyond being a neat fact? Because the prior framing tells you what you’re assuming, and the penalty framing hides it. “Coefficients are small around zero” is a fine default for many problems and a terrible one for others — if you have a predictor you know is strong, an L2 penalty is actively fighting you, insisting it should be small when you know it isn’t. Seeing the penalty as a prior makes that conflict legible: you’d never knowingly put a tight zero-centred prior on a coefficient you believe is large, but you’ll do exactly that, by reflex, when you slap a uniform ridge penalty on every coefficient including that one. The Bayesian view also points at the fix — different priors for different coefficients, which in penalty-land is the slightly awkward “feature-specific regularization,” and in prior-land is just… using what you know.

I don’t think one framing is correct and the other wrong. They’re genuinely the same thing, and I move between them by convenience: when I want a fast fit with a library, I think “ridge” and tune λ; when I want to reason about what the model assumes, or put a belief on a specific parameter, or get honest uncertainty, I think “prior” and reach for the Bayesian version. What changed for me wasn’t a technique — it was realising the wall between the two boxes was never there. Regularization was Bayesian all along; nobody told me, because the frequentist notation is careful never to mention it.

So now, when I regularize, I ask what prior I’m implicitly asserting, and whether I’d actually endorse it if it were written as a belief instead of a penalty. Usually I would. Sometimes, written plainly, the assumption is obviously wrong for the problem — and I’d never have noticed while it was disguised as a harmless-looking λ.¹

From the statistical-learning foundations of my coursework — the connection that quietly unified two things I’d kept apart. Code is schematic.

Footnotes

One honest caveat on “the same thing”: ridge equals the maximum a posteriori estimate under a Gaussian prior — the single most probable coefficient set, the peak of the posterior. The full Bayesian treatment gives you the whole posterior distribution, not just its peak, and that distribution is where calibrated uncertainty lives. So regularized point estimates are the mode of the Bayesian answer with the rest thrown away. For a point prediction that’s often all you need; when you need to know how sure to be, the discarded part is exactly the part that mattered.↩︎