When to stop trusting least squares
I was fitting a model for nightly rental prices in Rome — the usual predictors, number of guests, bedrooms, room type, review scores, regressed on log price. The fit looked fine until I plotted the residuals, and there they were: a scatter of listings priced far above anything the model expected. A penthouse near the Spanish Steps asking ten times the going rate for its size. The model hated it, and it was taking its revenge on every other prediction.
That’s the thing about ordinary least squares that’s easy to forget: it minimises squared error, so a residual of 10 counts a hundred times more than a residual of 1. A few extreme listings don’t just get predicted badly — they bend the entire fit toward themselves, and they inflate your estimate of the noise everywhere, so suddenly the model is uncertain about ordinary apartments it should have nailed. The outliers don’t stay in their corner. They tax the whole model.
The reflex fix is to delete them. Find the listings above some price cutoff, drop them, refit. I’ve done it. It works, sort of, and it always felt like cheating, because I was making a modelling decision — “these points don’t count” — by hand, with a threshold I picked because the plot looked better afterward. Where exactly is the line between a real luxury listing and an outlier? I didn’t have a principled answer, just a delete key.
The better fix is to stop assuming the errors are Gaussian in the first place.
A Gaussian likelihood says errors are small and symmetric and almost never large — its tails fall off so fast that a big residual is, as far as the model is concerned, nearly impossible. So when a big residual does show up, the model can’t shrug it off; it has to contort itself to explain it, because its own assumptions told it that point shouldn’t exist. The whole problem is that the likelihood refuses to believe in outliers, so it’s forced to take every one of them seriously.
The Student-t distribution has the same bell shape in the middle but genuinely heavy tails — it assigns real, non-trivial probability to the occasional extreme value. Use it as your likelihood and a wild listing is no longer a crisis. The model can say “that’s one of the rare big ones my tails already account for” and leave its estimates for everything else alone. You get robustness not by removing data but by being honest that extremes happen.
The part I like most is that you don’t have to specify how heavy the tails are. The Student-t has a degrees-of-freedom parameter, ν, that controls exactly that — low ν means heavy tails, high ν is indistinguishable from a Gaussian — and in a Bayesian model you just put a prior on it and let the data decide. Here’s the whole thing in JAGS:
# Gaussian: every residual must be small. one big one and sigma balloons.
y[i] ~ dnorm(mu[i], tau)
# Student-t: same centre, heavy tails. nu is LEARNED, not fixed.
y[i] ~ dt(mu[i], tau, nu)
nu_minus_two ~ dexp(1/30) # prior on (nu - 2), keeps variance finite
nu <- nu_minus_two + 2That nu being inferred rather than set is the quiet luxury of the Bayesian version. You’re not committing to “this data is heavy-tailed” up front; you’re asking the data how heavy, and getting a posterior over the answer. On the Rome listings ν landed well below the point where it would look Gaussian — the data was telling me, in a number, that yes, the tails are fat, here’s how fat. I didn’t have to pick a price cutoff. I didn’t delete a single listing. The penthouse stayed in the dataset and simply stopped dominating it.
What changed in the results was subtle and exactly what you’d want. The coefficients moved a little — toward the bulk of the data, away from the pull of the extremes — and the noise estimate shrank, because the model was no longer inflating its global uncertainty to accommodate a dozen oddballs. The predictions for ordinary apartments got tighter and better. The outliers were still there, still predicted poorly, but they were now contained instead of contagious.
I’ll be straight about the cost. The Gaussian model is the one everyone reaches for partly because it’s so well-behaved — conjugate, fast, closed-form in places. Going Student-t means MCMC, and ν can be a slightly sticky parameter to sample (it changes the tail behaviour, which the sampler feels), so you watch your diagnostics more carefully. For a quick regression on clean data, none of this is worth it and the Gaussian is right. But “clean data” is the assumption I keep being wrong about. Real datasets have a luxury penthouse in them somewhere.
These days, when residuals show me a few points the model can’t stomach, my first move isn’t the delete key. It’s to ask whether I ever had grounds to assume light tails, and usually I didn’t — I just inherited the assumption from every regression tutorial I’d ever read.1
From a Bayesian regression on public rental-listing data for Rome. The models are real; the code is trimmed to the idea.
Footnotes
There’s a subtlety in putting the prior on
nu_minus_tworather thannudirectly: the Student-t’s variance is only finite when ν > 2, so anchoring the parameter above 2 keeps the model well-defined. Thedexp(1/30)prior has a mean around 30, gently favouring near-Gaussian behaviour unless the data argues otherwise — a soft nudge toward the simpler model, overruled only when the tails really are heavy.↩︎