Every retention curve has a kink, and one power law can’t fit it

LTV

production ML

mobile gaming

My single power-law fit was good everywhere except the first week and the long tail — the two places that actually decide a twelve-month LTV. The fix was to stop fitting through the kink and start fitting around it.

Author

Umut Altun

Published

June 20, 2023

I’d settled on a two-parameter power law for retention, and for a while I was happy with it. Then I started overlaying the fitted curve on the actual data, cohort by cohort, and noticed it was good everywhere except two places: the first week, and the long tail. Which is an awkward way of saying it was good nowhere that mattered, because those two regions are exactly what a twelve-month LTV extrapolation hangs on.

The fit was splitting the difference. It ran a little high through the brutal early drop and a little low through the slow tail, landing a compromise curve that was wrong in both directions at once and right mainly in the middle, where I needed it least.

Once I stopped staring at the residuals and started thinking about the players, the reason was obvious. A retention curve isn’t one process, it’s two stuck together:

Days 1–8: the funnel flushing out. A lot of the install cohort was never going to stick — mis-targeted ads, curiosity installs, people who bounced off the tutorial. They churn fast and steep. This early region is governed by acquisition quality, not by the game.
Day 8 onward: the real curve. What’s left is the engaged core, and they decay slowly along a long, fat tail. This region is governed by the game, and it’s where almost all the lifetime value accumulates.

Those are two different decay regimes with two different exponents, and forcing one power law across both makes it average them — fitting through the kink between them instead of respecting it. The early steepness drags the tail estimate down; the tail’s flatness drags the early estimate up. You can’t win with one curve because there isn’t one curve.

So I fit piecewise — a separate power law in each regime — and made the number of pieces adapt to how much data the cohort actually has, because the failure mode at the other end is just as real: split a sparse cohort into three segments and you’re fitting noise three times instead of once.

n = len(observed_days)
if n < 21:
    segments = [(1, horizon)]                     # sparse: one fit, don't overfit
elif n < 45:
    segments = [(1, 8), (8, horizon)]             # split at the early/late knee
else:
    segments = [(1, 8), (8, 31), (31, horizon)]   # early drop, mid, long tail

# fit a power law within each segment; if a segment's fit fails to
# converge, fall back to a single full-span fit rather than emit garbage

A young cohort gets one honest fit. A mature one gets up to three, carving the early flush, the mid settling, and the long tail into their own curves. The kink at day 8 stops being an error the model fights and becomes a boundary the model respects.

Two things I’d flag as judgment calls rather than truths. The knee at day 8 I found by eye — overlaying fits across many games and noticing the early regime consistently gave out around there. It’s not sacred; a more principled system would detect the knee per cohort instead of hard-coding it, and a game with an unusual onboarding would want a different boundary. I used a fixed split because it was robust across the portfolio and I could reason about it, and a wrong-but-predictable boundary beats a clever knee-detector that occasionally finds a knee in noise. The other call is the fallback: a segment fit that doesn’t converge doesn’t get to emit nonsense, it collapses back to the single full-span fit. Degrade to the simpler model, never to a confident wrong one.

I’ve since seen the same thing well outside retention curves. When a model is wrong in a structured, repeatable way — high here, low there, every time — that’s not noise to tune away. It’s the data telling you your functional form is too simple for the process. The fix usually isn’t a fancier model. It’s looking at what’s actually generating the data, noticing it’s two things wearing a trenchcoat, and giving each one its own simple model.¹

From work on a cohort-LTV system for a mobile-gaming portfolio. Specifics, parameters, and numbers are abstracted; the reasoning is as built. Code is illustrative.

Footnotes

The adaptive piece count matters more than the exact thresholds. The principle — more data earns more flexibility, less data forces more humility — is what generalizes; the specific cutoffs are just where it landed for this portfolio after staring at a lot of fits. If you lifted this wholesale onto a different retention shape without re-checking the knee, you’d deserve what you got.↩︎