R-hat said 1.00. The effective sample size said 380.

statistics

Bayesian

MCMC

Every convergence number on my MCMC run looked perfect, so I almost shipped it. Then I checked the one diagnostic that isn’t about convergence at all — and found my 12,000 draws were worth a few hundred.

Author

Umut Altun

Published

July 30, 2023

I had an MCMC run I was ready to call done. Three chains, R-hat sitting at 1.00 across every parameter, trace plots that looked like fuzzy caterpillars — the picture of convergence. I almost wrote up the results. Then I looked at the effective sample size, and one parameter reported 380. Out of twelve thousand draws.

That gap is the thing nobody warns you about clearly enough, so here it is plainly: R-hat and effective sample size answer two different questions, and a perfect score on one tells you almost nothing about the other.

R-hat — the Gelman-Rubin statistic — asks whether your chains have converged. It runs several chains from different starting points and checks whether they’ve forgotten where they began and settled into the same distribution. It compares the variance between chains to the variance within them; when those match, the chains are exploring the same territory and R-hat approaches 1. It’s a necessary check, and when it’s far from 1 you definitely have a problem. The trap is reading “R-hat is 1” as “everything is fine.”

Because converging to the right distribution and sampling it efficiently are not the same achievement. MCMC draws are correlated by construction — each step is a small move from the last, so consecutive samples look alike. Your chain can be parked exactly where it should be, faithfully wandering the posterior, while taking such tiny correlated steps that a thousand draws contain only a handful of genuinely independent pieces of information. The chain has converged. It’s just barely moving.

That’s what effective sample size measures: not “are you in the right place” but “how many independent draws is this correlated mess actually worth.” High autocorrelation means a low ratio — twelve thousand draws collapsing to a few hundred effective ones. And your uncertainty estimates depend on the effective count, not the raw one. A posterior interval computed from 380 effective samples is far shakier than the twelve-thousand-row dataframe makes it look.

Both numbers are one line each:

gelman.diag(chains)     # R-hat: have the chains converged to the same place?
effectiveSize(chains)   # ESS: how many independent draws is that worth?

On my model the culprit was a parameter the others all leaned on — the kind of variable that’s tangled up with several coefficients at once, so the sampler can only nudge it in lockstep with them and inches along. R-hat didn’t care; given enough iterations the chains still converged. ESS cared a lot: that parameter’s draws were so autocorrelated that my effective information about it was a fraction of what the run size implied.¹

The fix, once you’ve actually diagnosed it, is mundane — run longer, thin less (thinning throws away draws and rarely helps ESS the way people hope), or reparameterise so the sticky parameter mixes better. The point isn’t the fix. The point is that I would never have gone looking for it if I’d stopped at R-hat, because R-hat was telling me, truthfully, that nothing was wrong with convergence. The problem was somewhere R-hat doesn’t look.

So I report both now, always, and I treat them as a pair that can disagree. R-hat near 1 with healthy ESS: trust it. R-hat near 1 with tiny ESS: converged but underexplored, your intervals are softer than they appear, keep sampling. R-hat far from 1: don’t even look at ESS yet, you haven’t converged. The dangerous quadrant is the second one, because it’s the only failure that comes disguised as success — every surface-level number looks great, and the model quietly knows less than you think.

From a Bayesian regression on public rental-listing data for Rome — the same model I wrote about in trusting heavy tails. Code trimmed to the idea.

Footnotes

The “tangled-up parameter” problem is usually correlation in the posterior — when two parameters trade off, the sampler has to move them together, and a sampler forced to move along a narrow diagonal ridge takes tiny steps. Reparameterising to break that correlation (centring, or the non-centred trick for hierarchical models) often does more for ESS than any amount of extra iterations. JAGS gives you less control here than Stan’s HMC, which is one honest reason to reach for Stan when mixing gets bad.↩︎