Old cohorts are lying to you

LTV

production ML

mobile gaming

The more historical data I gave the model, the worse it predicted. That isn’t supposed to happen — unless the thing you’re modeling quietly changes underneath you, which in mobile gaming it does, constantly.

Author

Umut Altun

Published

December 5, 2023

For a while I had a bug I couldn’t explain: the more historical data I fed the model, the worse its predictions got. I’d widen the training window to include more past cohorts — more data, more signal, more better, surely — and the held-out error would creep up. I spent a day assuming I’d broken the aggregation somewhere before I accepted the data was telling me something true and uncomfortable.

To fit the curves, you pool past cohorts together — the retention shapes and revenue trajectories of cohorts that have already had time to mature, used as the template for cohorts that are still young. The instinct, drilled into all of us, is that more history makes that template more stable. More samples, tighter estimates. It’s true when the thing you’re measuring sits still.

A mobile game does not sit still. The developers are shipping into it every week — new monetization events, a balance patch, a LiveOps calendar, a reworked onboarding. A cohort from three months ago doesn’t describe the game — it describes the game as it was three months ago, which has since been patched out of existence. Pooling that cohort in with equal weight doesn’t add signal. It drags the model toward a reality that no longer exists, and the further back you reach, the more confidently wrong the template becomes. The data wasn’t lying about its own moment. It was lying about now, because I was treating a snapshot of the past as evidence about the present.

The assumption I’d never examined was stationarity — that the process generating the data is the same today as last quarter. For a live game it’s just false, and once you name it the fix is almost annoyingly simple: weight cohorts by how recent they are, so the present dominates and the past fades without being thrown away.

lag = (cohort_date - latest_date).days       # ≤ 0: how many days into the past
weight = np.exp(lag / CUTOFF_DAYS)           # recent ≈ 1, older decays toward 0
# a ~7-day-old cohort carries roughly 2.7x the weight of a ~14-day-old one
fit_weighted(curves, weights=weight)

Exponential decay on age. Recent cohorts count for nearly their full weight; older ones fade smoothly toward zero. The model now learns mostly from the game as it is this week, with older cohorts contributing a faint, fast-dimming echo. Widening the window stopped hurting, because reaching further back now adds vanishing weight instead of equal weight, and the predictions snapped back in line with reality.

The decay rate is the one real knob, and it’s a genuine tradeoff I tuned by hand rather than derived. Decay too aggressively and you’re effectively fitting on the last few days only — you overreact to every weekend blip and a single noisy cohort can yank the whole curve. Decay too gently and you’re back to letting stale cohorts vote. I picked a rate that felt right against the portfolio’s actual patch cadence — fast enough to track real changes, slow enough to ignore noise — which is to say I tuned it by hand until the curves behaved. A more principled version would detect when a game actually changed and weight around those breakpoints, instead of assuming a smooth global forgetting rate for every title. I didn’t build that; smooth decay was robust, cheap, and good enough, and “good enough and predictable” kept winning over “clever and fragile” on this project.

Now, whenever a model gets worse as I feed it more history, the first thing I check is whether I’ve assumed stationarity in a world that isn’t. We’re trained that more data is always better. In a non-stationary process it isn’t — old data isn’t just less informative, it’s actively misleading, faithful evidence about a regime that’s gone. Sometimes the most useful thing you can do for a model is teach it to forget on purpose.¹

From work on a cohort-LTV system for a mobile-gaming portfolio. Specifics, parameters, and numbers are abstracted; the reasoning is as built. Code is illustrative.

Footnotes

Why decay old cohorts toward zero instead of hard-cutting them at some age: the fade is graceful where a cutoff is a cliff. A hard window means a cohort is fully trusted one day and fully discarded the next, which makes the model lurch every time an influential cohort ages out. Exponential weighting has no edge to lurch at — yesterday’s most important cohort is today’s slightly-less-important one, and nothing ever falls off a shelf.↩︎