Two LTV models that disagree, and the rule for which to believe

LTV

production ML

mobile gaming

One model is precise when a cohort has lots of data and falls apart when it doesn’t. The other is robust on sparse data and crude when there’s plenty. The valuable code isn’t either model — it’s the dozen lines that pick between them.

Author

Umut Altun

Published

March 12, 2024

I built two LTV models that routinely disagree with each other, and the most important code in the system is the dozen lines that decide which one to trust for a given cohort.

The reason there are two comes down to a fact about the input I couldn’t engineer away: cohorts arrive at wildly different levels of maturity, and a model that’s excellent for one is bad for the other. A cohort that installed this morning has almost no data — a day or two of retention, a trickle of revenue. A cohort from two months ago has a rich, fully-shaped curve. You’re asked to predict twelve-month LTV for both, from the same pipeline, and the honest truth is that no single model is good across that whole range.

The first model — call it the AR model — works the way the retention-curve approach suggests: fit retention, fit ARPDAU, integrate their product out to the horizon. When a cohort has enough data to actually fit those curves, this is the one you want — it’s mechanically faithful to how revenue accrues, and it’s precise. Starve it of data, though, and it’s worse than useless: you cannot fit a power law to three noisy points, and if you try, it’ll hand you a confident curve fit to nothing.

The second — the coefficient model — never fits a curve. It learns historical coefficients that map “revenue accumulated through day k” onto “revenue at the horizon,” normalized across cohorts to strip out scale. It’s cruder; it can’t capture a specific cohort’s curvature. But it degrades gracefully, because it needs almost nothing to produce a sane answer. On a day-old cohort it’s the robust choice precisely because it doesn’t try to be clever.

So they trade off exactly against each other along the maturity axis: AR is precise-but-fragile, coefficient is robust-but-blunt. The whole problem reduces to one question asked per cohort — which regime is this cohort in? — and the answer is just how much data it’s actually given me:

def choose_model(cohort):
    if cohort.avg_size_last_7d > VOLUME_CUTOFF:
        return AR_MODEL      # rich data: fit the curve, take the precision
    return COEF_MODEL        # sparse data: normalized coefficients, take the robustness

def predict(cohort):
    primary = choose_model(cohort)
    try:
        return primary.fit_predict(cohort)
    except FitError:
        return fallback_of(primary).fit_predict(cohort)   # never crash; degrade

High-volume cohorts get routed to precision; sparse ones get routed to robustness. And there’s a second safety net under the first: if the chosen model fails anyway — the AR fit doesn’t converge on a cohort that looked rich but was pathological — it falls back to the other one rather than throwing. The system’s contract is that it always returns a sane prediction, so “the preferred model didn’t work” can never become “the user gets an error.” Degrade, don’t die.

Which is why I think “what’s the best model?” is usually the wrong question — asking it keeps you tuning one model forever, trying to make it cover a range it structurally can’t. The better question is “what does this input let me get away with?” Here the input variable is data volume, and the answer is two specialists plus a cheap, legible router — and the router, the least glamorous part, is what makes the whole thing work in production.

One real weakness, and a sharp reader will spot it straight away: a hard threshold means two nearly-identical cohorts that land on opposite sides of the cutoff get handled by different models and can get visibly different predictions — a discontinuity right where the two models are least sure of each other. The better design is a blend: weight the two predictions by maturity and slide smoothly from coefficient to AR as a cohort ripens, instead of flipping. I know that, and I shipped the hard switch first because it was simple, debuggable, and good enough to be useful — the discontinuity sits in a region where both models are uncertain anyway, so it’s papered over by the confidence intervals. The smooth blend is the obvious next iteration. It’s just never quite outranked the things that were more broken.¹

From work on a cohort-LTV system for a mobile-gaming portfolio. Specifics, parameters, and numbers are abstracted; the reasoning is as built. Code is illustrative.

Footnotes

This isn’t a classic ensemble, and the difference matters. An ensemble averages models to reduce variance, assuming they’re all roughly valid. Here the models aren’t both valid at once — one is appropriate and the other is actively wrong for the cohort’s data regime. Averaging a good prediction with a known-bad one just contaminates it. Selection, not averaging, is the right tool when your models have disjoint domains of competence.↩︎