Why my LTV confidence intervals sample from two different distributions

LTV

production ML

mobile gaming

A point estimate is the wrong output for a budget decision. Building honest confidence intervals meant sampling retention from a Beta and revenue from a Gamma — because the distribution you draw from is a statement about what kind of quantity you’re modeling.

Author

Umut Altun

Published

September 19, 2023

A point estimate is the wrong output. When a UA team asks what a cohort’s twelve-month LTV will be, “1.2× ROAS” is not an answer they can use — because it matters enormously whether that means “probably 1.2×, I’d bet on it” or “somewhere between 0.7× and 1.8×, genuinely no idea.” The first is a green light. The second is a coin flip wearing a green light’s clothes. The single number hides the one thing the decision actually turns on.

So the real output is a distribution, and the honest way to get one is Monte Carlo: sample the model’s parameters many times, compute the LTV under each draw, and read the spread off the resulting pile of outcomes. Simple enough in outline. The interesting part is the question that simulation forces you to answer and a point estimate lets you dodge — what do you sample from?

Because there are two quantities feeding an LTV estimate, and they are different kinds of thing, and pretending they’re the same is how you get confidence intervals that are technically computed and quietly meaningless.

The first is retention — what fraction of the cohort is still around on day d. That’s a rate: out of N users, some stuck and some didn’t. The natural distribution for “a rate, given counts” is the Beta, which is the conjugate of exactly that binomial process. And it has a property I got for free and came to love: it encodes sample size automatically.

# retention is a rate. Beta is the conjugate of "k stuck out of N".
# small cohort -> few counts -> wide Beta -> honestly uncertain
# big cohort   -> many counts -> tight Beta -> confidently precise
retention = np.random.beta(retained + EPS, churned + 1, size=(N_SIMS, horizon))

A cohort of fifty users and a cohort of fifty thousand might show the same observed retention, but the Beta drawn from fifty is wide and the one drawn from fifty thousand is narrow — so the uncertainty in the final LTV widens for small cohorts on its own, without a single rule telling it to. The distribution does the calibration. That’s the whole appeal: I’m not bolting uncertainty on afterward, I’m sampling from a thing whose shape already knows how sure it should be.

The second quantity is ARPDAU — average revenue per daily active user — and it is emphatically not a rate. It’s money: strictly positive, and savagely fat-tailed, because a handful of whales generate most of it while everyone else spends nothing. Sampling that from a Normal, the reflexive choice, is wrong twice over — a Normal will cheerfully hand you negative revenue, which is nonsense, and its thin tails will systematically understate the whales, who are the entire game. So ARPDAU comes from a Gamma: positive support, a fat right tail, parameterized from the revenue and the user count.

# ARPDAU is positive and fat-tailed (whales). Gamma fits; a Normal would
# emit negative revenue and amputate the tail that matters most.
arpdau = np.random.gamma(shape=rev_sum + EPS, scale=1/(users + 1), size=(N_SIMS, horizon))

revenue = retention * arpdau                         # per-user revenue, N_SIMS draws
p10, p50, p90 = np.percentile(revenue.sum(axis=1), [10, 50, 90])

Multiply the two sampled arrays, sum over the horizon, take percentiles, and you have a P10/P50/P90 that means something — the dashboards show that band, not a lone number, and the UA team reads “this cohort is probably fine but the floor is low” straight off the width.

So: the distribution you sample from is a modeling statement, not a default. Reaching for a Normal because it’s the one everybody remembers is a decision about the quantity — you’re asserting it’s symmetric, unbounded, thin-tailed — and for a retention rate or a revenue figure that assertion is just false. Choosing Beta for the rate and Gamma for the money isn’t statistical fussiness; it’s matching the shape of the randomness to the shape of the thing, and the confidence interval is only honest if you do. A wrong distribution won’t throw an error — it just hands you a number that looks like an uncertainty and isn’t.

I’ll be honest about the limit: “fat-tailed, so Gamma” is a defensible call, not a proven one. Whale spending might be better described by a lognormal or something heavier still, and I didn’t run that down as rigorously as I’d like — Gamma was principled enough, cheap to sample in bulk, and close enough that the intervals were useful. If I were hardening this I’d actually fit the tail and check. But the bones are right, and they were right because I started from “what kind of quantity is this?” instead of from whatever np.random function came to mind first.¹

From work on a cohort-LTV system for a mobile-gaming portfolio. Specifics, parameters, and numbers are abstracted; the reasoning is as built. Code is illustrative.

Footnotes

The Beta-binomial conjugacy isn’t just aesthetic. It means the posterior over the retention rate is a Beta in closed form, so I can sample it directly without MCMC or any fitting step — which is what keeps the whole Monte Carlo path cheap enough to run over thousands of cohorts on a schedule. The right distributional choice paid for itself in compute, not just in correctness.↩︎