Two information criteria that disagreed about my model

statistics

Bayesian

model comparison

DIC is the model-comparison number JAGS hands you for free. WAIC is the one the field quietly moved to. When they rank your models differently, the disagreement is telling you something about your posterior.

Author

Umut Altun

Published

May 8, 2024

I had two models — a Gaussian regression and a Student-t one — and a simple question: which generalises better? Bayesian model comparison is supposed to answer exactly this, with an information criterion: a single number trading off how well the model fits against how complex it is, so you can rank models without overfitting your way to a wrong winner. JAGS hands you DIC for that. I also computed WAIC, because I’d read it was the better choice. They picked the same winner, but the margin differed enough to make me actually understand the difference between them, which I’d been hand-waving for years.

Both criteria are doing the same broad thing: estimating how well the model would predict new data, and penalising effective complexity so a model can’t win just by having more parameters to bend. The difference is in how they estimate the fit, and it comes down to one choice that sounds technical and turns out to matter.

DIC — the deviance information criterion — measures fit using the deviance evaluated at the posterior mean of the parameters. It takes your whole posterior, collapses it to its average point, and asks how well that single best-guess parameter set fits. It’s cheap, it falls straight out of the MCMC samples you already have, and for well-behaved models it’s perfectly reasonable. The hidden assumption is that the posterior mean is a good summary of the posterior — that the distribution is roughly symmetric and unimodal, so its average actually represents it.

WAIC — the Watanabe-Akaike criterion — doesn’t collapse anything. It works pointwise: for each individual observation, it averages the likelihood across the entire posterior, then sums the log of those averages. It uses the full posterior distribution rather than a single point from it, which makes it more principled when the posterior is skewed, heavy-tailed, or multimodal — exactly the situations where “the posterior mean” stops being a faithful stand-in for the posterior.

# DIC: fit measured at the posterior MEAN — one collapsed point
DIC = deviance(mean(theta)) + p_DIC

# WAIC: fit measured pointwise, averaged over the FULL posterior
WAIC = -2 * sum( log( mean_over_posterior( likelihood(y_i | theta) ) ) ) + p_WAIC

Once you see that, the cases where they disagree stop being mysterious. If your two models have similar fit but very different posterior shapes — say a Student-t model whose ν parameter has a long, skewed posterior — DIC’s collapse-to-the-mean throws away exactly the information that distinguishes them, while WAIC keeps it. So DIC can rank two models nearly tied while WAIC sees a clear gap, or, in nastier cases, they can flip outright. When they disagree, it’s usually DIC’s symmetry assumption quietly failing, and that failure is itself worth noticing — a posterior that breaks DIC is a posterior you should be looking at more closely anyway.

In my case they agreed on the winner — the Student-t model — which was reassuring rather than interesting. What was interesting was watching DIC report a smaller margin than WAIC. The Student-t’s advantage lived partly in how it handled the tails, which is partly a statement about the shape of its predictive distribution per observation, and that’s precisely the thing DIC averages away and WAIC retains. The two numbers weren’t contradicting each other. They were measuring the same thing through different amounts of blur, and the disagreement in margin was a hint about where the models actually differed.

So my rule now is boring and practical. Compute DIC because it’s free and it’s a fine first look. Reach for WAIC — or honestly, for proper cross-validation, which WAIC is an efficient approximation of — when the decision is close, when the posteriors aren’t tidy bell shapes, or when the two criteria disagree and you need to know which to believe. And when they do disagree, don’t just take the more sophisticated one’s answer and move on. The disagreement is a flag that something about your posterior isn’t as well-behaved as DIC assumes, and that’s usually worth a look before you trust either number.¹

From a Bayesian regression on public rental-listing data for Rome. Code is schematic — the real computations come from the MCMC output.

Footnotes

WAIC’s real selling point is that it’s asymptotically equivalent to leave-one-out cross-validation but doesn’t make you refit the model N times. If you have the compute and you really care, do the cross-validation directly (PSIS-LOO is the modern, efficient version with its own diagnostic for when it’s untrustworthy). Information criteria are conveniences that approximate “how well does this predict held-out data” — the question cross-validation answers by brute force.↩︎