A thousand simulations per cohort, and never a loop in sight

LTV
production ML
mobile gaming
Monte Carlo confidence intervals have a reputation for being too slow for production. They’re not — naive Monte Carlo is slow. The fix is to stop thinking one simulation at a time and let the array library do all thousand at once.
Author

Umut Altun

Published

June 18, 2024

The requirement: a thousand Monte Carlo draws per cohort, each draw a full horizon of retention and revenue, across thousands of cohorts, refreshed on a schedule, running on serverless functions that have a timeout and bill by the millisecond. Written the obvious way, that’s a non-starter. The reason it works anyway is a single discipline — never write the loop.

Monte Carlo carries a reputation for being the honest-but-too-slow option, the thing you’d love to use for proper confidence intervals if only you could afford it. That reputation comes entirely from the naive implementation, the one that reads like the textbook description: for each cohort, for each of a thousand simulations, for each day of the horizon, draw and accumulate. Three nested Python loops. It’s correct, it’s readable, and it is unusably slow — millions of Python-level iterations per cohort, interpreter overhead on every one, and you’ve blown the Lambda timeout on a single game.

The shift is to stop picturing a thousand simulations happening one after another, and start treating the whole batch as one object. The samplers already hand you the entire block — np.random.beta with a size of (simulations × days) returns all of it in one call — and every operation after that is array-at-a-time, executed in NumPy’s C internals instead of the Python interpreter:

# the sampler already returns the whole (N_SIMS, horizon) block at once
retention = np.random.beta(a, b, size=(N_SIMS, horizon))
arpdau    = np.random.gamma(shape, scale, size=(N_SIMS, horizon))

ltv = (retention * arpdau).sum(axis=1)          # N_SIMS lifetime values, zero loops
p10, p50, p90 = np.percentile(ltv, [10, 50, 90])

There is no for over simulations and no for over days. The element-wise multiply happens across the entire (simulations × horizon) array in one vectorized operation; the .sum(axis=1) collapses each simulation’s horizon into a single LTV, leaving a thousand of them; the percentiles read the confidence band straight off that. The thousand simulations don’t run in sequence — they run as one block of arithmetic the CPU is built to chew through. Per cohort, it drops from “blows the timeout” to comfortably sub-second, which is what lets thousands of cohorts fan out across serverless workers and finish on schedule.

The performance isn’t a vanity metric, which is the part that justifies caring. Sub-second per cohort is what makes the what-if tool feel alive — a UA lead asks “what happens to portfolio ROAS if I move budget from this channel to that one?” and the answer comes back while they’re still looking at the screen, because re-running the simulation across the affected cohorts is fast enough to be interactive. Had I left the loops in, that feature couldn’t exist; you don’t build an interactive simulator on top of a computation that takes a minute. Speed changed what the system was for, not just how fast it did the same thing.

There’s a real cost to vectorizing, and I’d be lying to pretend otherwise: the code gets harder to read. A loop with named variables tells you what it’s doing at each step; a stack of array operations asks you to carry the shapes in your head — is this (sims, days) or (days, sims), what does axis=1 collapse here — and a transposed axis is a silent bug that produces plausible, wrong numbers rather than an error. I leaned on shape comments and a fast test that diffs the vectorized output against the dead-simple looped version on a tiny input, so I’d know immediately if a refactor quietly broke the math. The loop version earns its keep as the oracle even though it never runs in production.

The rule of thumb I’ve leaned on ever since: if you’re writing a Python for loop over your data points, you’re usually leaving a one-to-two-orders-of-magnitude speedup on the table. The fix isn’t a faster language or a bigger machine — it’s expressing the computation as operations over whole arrays, so the work drops into compiled code. “Monte Carlo is too slow for production” almost always means “my Monte Carlo has a Python loop in it.” Take the loop out and the technique you wanted to use all along is suddenly affordable.1


From work on a cohort-LTV system for a mobile-gaming portfolio. Specifics, parameters, and numbers are abstracted; the reasoning is as built. Code is illustrative.

Footnotes

  1. Vectorization trades time for memory — a (sims × horizon) array per cohort is materialized all at once, and if you scaled simulations or horizon by 100× you’d hit memory limits before time ones. At this size it’s a non-issue and I provisioned the workers for the larger cohorts. But “vectorize everything” stops being free once the arrays get big enough to page, and then you’re back to batching — looping, but over big blocks instead of single elements.↩︎