Checking a model by asking it to imagine data

statistics

Bayesian

model checking

Goodness-of-fit numbers tell you a model is wrong. Posterior predictive checks tell you where and how — by having the model generate fake datasets and seeing which features of reality it can’t reproduce.

Author

Umut Altun

Published

November 7, 2023

Here’s a question that’s easy to ask and weirdly hard to answer well: is my model any good? You can quote an R², a DIC, an error metric — single numbers that rank models against each other. But they don’t tell you how a model is wrong, which is the thing you actually need to fix it. For that, the most useful tool I know is almost embarrassingly intuitive: make the model imagine data, and check whether its imagination matches the world.

That’s the whole idea behind a posterior predictive check. You’ve fit a Bayesian model, so you have a posterior — a distribution over all the parameters. Now run it forward: draw a set of parameters from the posterior, and use them to generate a complete fake dataset the same size as your real one. Do that a few hundred times and you have a few hundred datasets the model considers plausible. If the model is any good, your real dataset should look like it belongs in that crowd. If reality is an obvious oddball next to the model’s imagined data, the model is missing something — and the way it stands out points straight at what.

The trick is choosing what to compare. You don’t compare the datasets point by point; you pick a summary statistic — some feature of the data you care about — and check whether the real value sits comfortably inside the spread of imagined values. Pick the maximum, and you’re asking “can my model produce extremes as large as the ones I actually see?” Pick the standard deviation, the skew, the number of values above some threshold — each one probes a different way the model might be lying to you.

When I did this on the Rome rental-price model, the Gaussian version failed exactly where I’d have guessed if I’d thought about it. I used the largest price as the statistic. Across hundreds of datasets the Gaussian model imagined, the biggest listing it ever dreamed up fell well short of the real top-end prices. Reality’s maximum sat out in the far tail of what the model thought possible — a clear signal that the model couldn’t generate the extremes the city actually contains. The Student-t version, with its heavy tails, imagined datasets whose maxima comfortably bracketed the real one. Same check, opposite verdict, and the check told me precisely which feature — the tails — separated the two.

You can compress that into a single number, a Bayes p-value: the fraction of imagined datasets whose statistic exceeds the real one.

# T() is your chosen statistic — here, the maximum price
p_value <- mean( T(y_rep) >= T(y_obs) )   # y_rep: imagined; y_obs: real
# ~0.5  -> real data sits in the middle of the imagined crowd (good)
# ~0 or ~1 -> real data is an outlier the model can't reproduce (bad)

A value near 0.5 means your real statistic lands right in the middle of the imagined ones — the model reproduces that feature fine. Near 0 or 1 means the real data is out at the edge of the model’s imagination, and that feature is a misfit.

Now the part that’s worth being careful about, because it gets misread constantly: this is not a hypothesis test. The Bayes p-value looks like a frequentist p-value, and it is not one. There’s no null hypothesis, no Type I error rate, no 0.05 line to cross. A value of 0.02 doesn’t “reject” your model at 2% significance. It’s a discrepancy measure — it says “the model reproduces this particular feature poorly,” and that’s all. Treating it as a pass/fail test is how people turn a tool for understanding their model into a worse version of a tool for grading it.

What I actually do with PPCs is run several at once, each probing a different feature, and read them as a map of where the model is strong and weak. Maybe it captures the centre of the distribution beautifully and the tails terribly — useful, that tells you to change the likelihood, not the predictors. Maybe it nails the spread but misses a skew — different fix. The single goodness-of-fit number would have told me “Student-t is better.” The posterior predictive check told me why: the Gaussian literally could not imagine a city with prices this extreme, and once I saw that, the fix chose itself.¹

From a Bayesian regression on public rental-listing data for Rome. Code trimmed to the idea.

Footnotes

Why the maximum was the right statistic to check here and the mean would have been useless: both models reproduce the mean fine — that’s what the linear predictor is for. The disagreement was entirely in the tails, so a tail-sensitive statistic (the max, or a high quantile, or a count of extreme values) was the only thing that would surface it. Choosing a statistic that the competing models agree on tells you nothing; the art is picking ones where they’d differ.↩︎