I trained a neural net to predict retention, then threw it away

LTV
production ML
mobile gaming
It predicted retention curves measurably better than my two-parameter power law. I deleted it and kept the power law — because on a backtest the net won, and on the only axis that mattered it lost badly.
Author

Umut Altun

Published

March 14, 2023

At one point I had a neural network that predicted retention curves measurably better than my two-parameter power law. I deleted it, kept the power law, and I’d make the same call again.

Here’s the setup. A mobile-gaming UA team has to decide, on day one of an install cohort, whether that cohort will be worth what they paid for it by month twelve. They cannot wait ninety days for the cohort to ripen — the budget decision is today, and the channel either gets more money tomorrow or it doesn’t. So you take the few days of retention you’ve actually observed, fit a curve, and extrapolate it out to the horizon. The model’s entire job is to extrapolate a tail it cannot see yet, from very little data.

I tried a neural net at this because of course I did — more capacity, more features, and on held-out cohorts it was, genuinely, a bit more accurate. That’s seductive. Then I tried to put it into production and three things killed it.

The first was data. Per game you don’t have millions of cohorts, you have a modest number, and a hungry model overfits them happily and confidently. The second was that it was off-distribution on exactly the cases I cared about most — a brand-new game, a country with no history — and it would produce something self-assured and wrong precisely when there was no ground truth to catch it. The third killed it for good: I couldn’t explain it. The UA lead would point at a cohort and ask “why is this one predicted low?” and the honest answer was “the network said so,” which is not an answer anyone can act on.

The power law has none of that capacity, and that turns out to be the feature.

from scipy.optimize import curve_fit

def power_law(day, a, b):
    return a * day ** b          # b < 0 → decay; b is the whole story

(a, b), _ = curve_fit(power_law, days, observed_retention, maxfev=1000)
# b ≈ -0.5  → a slow-burn game that holds its tail
# b ≈ -1.2  → a fast-drop game
# an analyst can read the shape of the game straight off the fit

Two parameters, and b is the entire personality of the curve. A gentle exponent is a slow-burn title that holds its tail; a steep one is a game that bleeds users fast. a is the level, b is the shape, and both mean something a human can name. When the UA lead asks why a cohort is predicted low, the answer is “its retention exponent is steep — look, it’s dropping faster than your portfolio average,” and you can point at it on a chart. The model is arguable. You can disagree with it, which means you can trust it.

(Why a power law and not an exponential, since both decay: retention curves are power-law shaped — a brutal early drop, then a long fat tail of the most engaged users. An exponential decays too fast and throws that tail away, and the tail is exactly where the lifetime value lives. Fitting the wrong functional form is a more expensive mistake than any amount of hyperparameter tuning.)

In production, feeding a human decision, an interpretable model that’s accurate enough beats an opaque one that’s a bit more accurate. Almost every time. Because the prediction isn’t the product — the decision is. A UA lead who can’t interrogate a number won’t bet a budget on it, and a prediction nobody bets on doesn’t change anything, no matter how good its backtest looked. Accuracy on held-out data is one axis. “Will a smart, skeptical operator actually act on this” is a different axis, and it’s usually the one that decides whether a model earns its keep or quietly gets ignored.

The neural net wasn’t wrong. It was wrong for this. Hand me orders of magnitude more data per game and a context where nobody needs the model to explain itself, and I’d reach for it without hesitation. But “won on the backtest” answered a question nobody on the UA team was actually asking. The power law is still in production years later. The net is in a git history somewhere, a little more accurate and completely unused.1


From work on a cohort-LTV system for a mobile-gaming portfolio. Specifics, parameters, and numbers are abstracted; the reasoning is as built. Code is illustrative.

Footnotes

  1. curve_fit does nonlinear least squares under the hood, and it’s reliable here precisely because there are only two parameters and the function is well-behaved — there’s almost nothing for the optimizer to get lost in. That stability is itself an argument for simple functional forms: a fit that converges the same way every time is one you can run unattended across thousands of cohorts without babysitting it.↩︎