Umut Altun — Writing

Designing an LLM system that thinks in Turkish

Umut Altun — Mon, 25 May 2026 00:00:00 GMT

The bug report was a screenshot of a spreadsheet where every Turkish character had been shredded — çalışan showing up as Ã§alÄ±ÅŸan. The data was correct. The numbers were right. It just looked like garbage, and to the operator who’d downloaded it, “looks broken” and “is broken” are the same thing.

That screenshot is my favourite example of something I badly underestimated: building an LLM system that works in Turkish is about five percent the model and ninety-five percent everything around it.

The model part is the easy part, which genuinely surprised me. The users ask in Turkish, the schema is in Turkish — column names, and categorical values like şube (branch) or zayi (waste) — and the answers need to come back in Turkish. I’d braced for this to be the hard problem and it mostly wasn’t: modern models handle Turkish comfortably, and I write the system prompts in Turkish too, so the whole pipeline stays in-language instead of translating in and out at the edges. That’s the one model-level decision I’d insist on — stay in Turkish end to end, never round-trip through English — because every translation hop is a place for a nuance or a proper noun to get quietly mangled.

The hard part is that the entire ecosystem around the model assumes English, and every one of those assumptions is a small landmine.

The CSV export is the perfect specimen, because it’s so dumb and it bit real users. An operator asks a question, likes the answer, clicks download, opens the file in Excel — and Excel, especially on Windows, does not assume a CSV is UTF-8. With no explicit marker it falls back to the system locale’s encoding, reads the multi-byte Turkish characters as if each byte were its own character, and produces the shredded mess from the screenshot. The file was always valid UTF-8. Excel just refused to believe it.

The fix is two bytes:

# Excel on Windows won't assume UTF-8 without a byte-order mark, so it
# misreads Turkish characters (çalışan -> Ã§alÄ±ÅŸan). Prepending a BOM
# states the encoding explicitly, and the export opens correctly.
csv_bytes = "".encode("utf-8") + csv_text.encode("utf-8")

A byte-order mark at the front of the file. That’s the whole fix. It isn’t clever and it isn’t interesting, and finding it took an absurdly long time — because every tool I used (a Mac, a terminal, a reasonable editor) rendered the file perfectly. The bug only existed in the one place I wasn’t testing: a Turkish operator’s Windows laptop, opening Excel. The lesson isn’t about BOMs. It’s that localization bugs live in the gap between your environment and your user’s, and if you only ever look at your own screen, you will never once see them.

This is what I mean by ninety-five percent everything-else. “Thinks in Turkish” isn’t a capability you switch on at the model. It’s a property you have to carry through every layer: the prompts, the schema, the model’s output, the chart labels, the number and date formatting, the encoding on the file going out the door. Miss one layer and the whole thing feels broken even when the intelligence underneath is flawless — because users don’t grade your model, they grade the spreadsheet that opened wrong.

And the defaults fight you the whole way. The libraries default to English, the encodings default to whatever was convenient in California, every tutorial’s examples are pure ASCII. Building in a non-English language means noticing and overriding a long tail of these, each one individually trivial and collectively the actual job. I’ve started thinking of it as a tax you pay for operating outside the language the tools were designed for — invisible right up until a screenshot of Ã§alÄ±ÅŸan lands in your inbox.

If I were advising someone starting a non-English LLM product: budget for the tax, and test in your users’ environment from day one, not your own. The model will speak their language fine. It’s the plumbing around it that defaults to English, and the plumbing is where the experience is won or lost.¹

From a consulting project building natural-language analytics for restaurant businesses. Customer details and schema are abstracted; the reasoning is as built. Code is illustrative.

Footnotes

Turkish has a genuinely nasty trap for the kind of string matching I rely on elsewhere: the dotless ı and the dotted i are different letters, and in a Turkish locale "I".lower() is "ı", not "i". Case-insensitive comparison that’s correct in English silently does the wrong thing on Turkish text. Anywhere you canonicalize or match strings, the locale of your lower() call quietly decides whether your matching works.↩︎

The eight silent seconds that made my app feel broken

Umut Altun — Mon, 20 Apr 2026 00:00:00 GMT

I watched three different people decide my app was broken. It wasn’t. It was just quiet.

Here’s what happened. A user types a question, and behind the scenes the agent does real work: it routes the question to the right specialist, writes SQL, runs it against the warehouse, then summarises the result into a sentence and a chart. End to end, that’s around eight seconds — a couple of LLM calls and a database query, none of it unreasonable. And the answer that comes back is correct.

But for those eight seconds, the screen showed nothing. A spinner, maybe, doing its little spin. And what I saw, over and over, was the user wait about four seconds, conclude it had hung, and refresh the page — which of course threw away the in-flight request and started the whole thing over, making it worse. They weren’t wrong to do it. Every signal the interface gave them said “nothing is happening here.” Silence reads as failure.

My first instinct was to make it faster. That’s the engineer’s reflex: latency too high, drive it down. I spent a little while there before admitting that eight seconds for two LLM calls and a query is already roughly where it’s going to be, and shaving it to six wouldn’t change anything — six silent seconds also reads as broken. I was optimising the wrong quantity. The problem was never the latency. It was the silence. Users don’t actually mind waiting if they can see that something is happening and that it’s happening for them.

So I stopped trying to make it faster and started making it legible. The pipeline already moves through distinct stages — routing, writing the query, running it, summarising — so instead of hiding those behind one spinner, I streamed them to the front end as they happened, over server-sent events:

routing your question…
writing the query…
running it against your data…
summarising the result…

Same eight seconds. Completely different experience. Now the wait has a narrative: the user watches the system think, each line proof that work is being done on their behalf, and nobody refreshes anymore because nothing ever looks stuck. I changed the perceived latency without touching the actual latency by a millisecond. For an interactive tool, perceived latency is the one that pays rent.

And then it didn’t work, which is the part worth writing down. I wired up the events, the backend emitted them in order, and the front end received… all of them at once, in a single clump, right at the end — exactly the silence I’d built this to kill, now with extra machinery. The backend was streaming correctly. Something between the backend and the browser was holding the whole response back and delivering it in one piece.

That something was the reverse proxy. By default it buffers responses — a perfectly sensible optimisation for normal request/response traffic, and the precise opposite of what streaming needs. It was collecting my carefully-streamed events into a buffer and flushing them together, defeating the entire point. The fix is two lines, and finding them took an embarrassing fraction of an afternoon:

location /api/ {
    proxy_buffering off;            # stream chunks through, don't accumulate
    add_header X-Accel-Buffering no; # belt and suspenders for the same thing
}

That’s it. That’s the difference between a progress stream and a spinner that sits dead for eight seconds and then dumps everything at once.

I keep that snippet around partly because I’ll need it again and partly as a reminder of what “productionising an LLM app” actually consists of. The prompt engineering and the model choice get the attention, but a real share of the work that decides whether people keep using the thing is unglamorous plumbing exactly like this: streaming, buffering, timeouts, the difference between correct and correct-and-legible. None of it shows up in a demo, because a demo is one person who already knows it works and is willing to wait. Production is a stranger who assumes it’s broken the moment it goes quiet.¹

So now, when something feels slow, I ask whether the problem is the duration or the silence before I spend a week chasing the duration. Often the cheapest fix isn’t making the work faster — it’s making the work visible.

From a consulting project building natural-language analytics for restaurant businesses. Customer details and infrastructure specifics are abstracted; the reasoning is as built. Code is illustrative.

Footnotes

I used server-sent events rather than WebSockets on purpose: the data only flows one way (server to client), SSE is just HTTP so it sails through proxies and load balancers with far less ceremony, and it reconnects on its own. WebSockets would have been a heavier answer to a one-directional question.↩︎

Letting the LLM fix its own SQL — but only twice

Umut Altun — Sun, 08 Mar 2026 00:00:00 GMT

The first time the agent fixed its own broken SQL, I was delighted. The query had failed on a type mismatch, I’d fed the error back to the model, and it came back with a corrected query that just worked. Self-healing! The third time it tried to “fix” the same query — producing a third distinct, confidently wrong variation — I realized I’d built an elegant way to set tokens on fire.

Back up. Even after grounding the model in real values and canonicalizing its filters, generated SQL still fails sometimes. A subtle type mismatch, a function used slightly wrong, an aggregation that doesn’t quite parse. The naïve response is to surface the database error to the user — but my users are non-technical restaurant operators, and Cannot GROUP BY an aggregate of type FLOAT64 means precisely nothing to them. A stack trace is not an answer.

The good response is self-correction. When a query fails, don’t give up — hand the model back its own query and the exact error the database returned, and ask it to fix it. This works far better than you’d expect, because the error message is genuinely informative: the model wrote bad SQL not knowing the column was a string, the database says so plainly, and the model corrects it. Most failures resolve on the first retry.

The trap is the word “retry,” with no number attached.

Because some queries don’t get fixed. The question is genuinely ambiguous, or the data can’t answer it, and the model doesn’t know that — so it keeps producing new wrong queries, each one different, each one looking like progress. That’s the insidious part: an unbounded correction loop doesn’t sit there obviously stuck. It looks busy. It’s generating, executing, failing, generating again, and every cycle costs another LLM call and another few seconds while the user stares at a spinner and the token meter ticks up. Left alone, it’s a loop that mistakes motion for progress and bills you for the privilege.

So the correction loop is bounded — a deliberately small number of attempts — and the bound is the whole point:

def run_with_correction(question, max_attempts=...):  # deliberately low
    sql = generate(question)
    for attempt in range(max_attempts):
        if estimate_cost(sql) > COST_CEILING:        # dry-run BEFORE spending anything
            return clarify("that looks very broad — can you narrow it down?")
        ok, result, error = execute(sql)
        if ok:
            return result
        sql = regenerate(question, failed_sql=sql, error=error)  # feed the error back
    return clarify("I couldn't turn that into a query I trust — did you mean X or Y?")

Two guards, doing different jobs. The loop bound caps how many times the model is allowed to be wrong before the system stops and asks the user a clarifying question instead of spinning. And estimate_cost is a dry run — the warehouse will tell you how many bytes a query would process without executing it — so a pathological query that would scan an entire dataset gets caught and refused before it runs up a bill, not after. One guard bounds the model’s stubbornness; the other bounds its appetite.

Picking the bound was, like most of these constants, empirical — I watched what the retries actually did. What I found: if a query isn’t fixed within the first attempt or two, it’s almost never fixed by attempt five either — those later attempts are the model thrashing, not converging. So the ceiling is low, and crossing it isn’t treated as failure, it’s treated as a signal: this question can’t be answered as asked, so stop guessing and ask the human. Falling back to a clarifying question is a much better experience than either a stack trace or a four-second silence that ends in one anyway.

It holds for any agent with a feedback loop, not just this one: giving a model the right to correct itself is powerful, but “fix yourself” without “…up to N times” is an unbounded loop wearing a helpful smile. Any time you let an LLM react to its own output — retries, critiques, multi-step plans that revise themselves — you need a hard stop that doesn’t depend on the model deciding it’s done. The model is not a reliable judge of whether it’s making progress; that’s exactly the faculty that’s missing. So the bound lives in the harness, in plain Python, where it can’t be talked out of stopping.¹

The honest open question: the right fallback when you hit the bound is probably smarter than “ask the user.” Sometimes the model’s second attempt was closer than its third, and I throw that away. A better system would keep the best partial result and offer it with a caveat, rather than discarding the whole loop. I haven’t built that yet — it’s on the list, somewhere below the things that were actually on fire.

From a consulting project building natural-language analytics for restaurant businesses. Customer details, schema, and constants are abstracted; the reasoning is as built. Code is illustrative.

Footnotes

Why feeding the error back works at all: the failure message is a high-signal, perfectly-targeted hint. The model didn’t write bad SQL out of stupidity — it wrote it missing one fact (a type, a column’s real name), and the database’s error supplies exactly that fact. It’s the cheapest fine-tuning signal you’ll ever get, available for the cost of catching an exception.↩︎

Teaching an analytics agent when not to ask

Umut Altun — Mon, 12 Jan 2026 00:00:00 GMT

“How am I doing this week?” is a real question that real users ask, and it’s missing almost everything you’d need to answer it. Which metric — revenue, footfall, margin? Which location? Compared to what? The naïve agent answers anyway: it silently picks a metric and a scope and hands back a confident chart of something the user didn’t quite ask for. A guess dressed up as an answer.

So the first fix is obvious and correct: before writing SQL, check whether the question is actually answerable, and if it isn’t, ask. Is there a time range? A scope? A metric? If something essential is missing, the agent says “over what period — this week, this month?” instead of inventing one. The router could already refuse when it wasn’t sure which domain a question belonged to; this is the same instinct one level down — refuse to proceed on missing parameters rather than fabricate them.

I shipped that, felt good about it, and watched it become annoying within a day.

Because the completeness check, applied naïvely, asks every time. The user types “show me last 30 days of sales,” gets their answer, and then types “what about by branch?” — and the agent, evaluating that second question in isolation, sees no time range and asks “over what period?” The user already said. Thirty seconds ago. They said it. Now they’re answering the same question again, and the magic of “just ask in plain language” has curdled into a form with a chatbot’s manners. An agent that re-interrogates you on every turn isn’t careful, it’s exhausting, and exhausting tools get abandoned no matter how correct they are.

The real requirement, it turned out, wasn’t “ask when something’s missing.” It was “ask when something’s missing and can’t be recovered from what was already said.” That’s a much narrower trigger, and it’s the difference between an assistant and a bureaucrat.

So the agent carries a short conversational memory — the parameters from recent turns — and a missing parameter is inherited from context before it’s ever treated as missing:

# session-scoped, bounded: LRU over active sessions, TTL eviction
state.recent = [...]          # parameters resolved on previous turns

def resolve(question):
    params = extract(question)
    for slot in REQUIRED:                 # time range, scope, metric, ...
        if slot not in params:
            params[slot] = state.recent_value(slot)   # inherit before asking
    missing = [s for s in REQUIRED if s not in params]
    return ask_user(missing) if missing else params

“What about by branch?” now inherits the 30-day window from the turn before and just answers. The agent only stops to ask when a slot is genuinely unrecoverable — when you’ve changed the subject, or when you never specified it in the first place and context offers no hint. Knowing when not to ask turned out to be as much of the design work as knowing when to.

The piece I’d flag for anyone building this: keep that state bounded, and treat that as a feature, not a limitation. The agents are otherwise stateless, which is exactly what lets them scale horizontally — any instance can handle any request. The moment you bolt on unbounded conversational memory, you’ve quietly introduced a coordination problem and a slow memory leak. So the session memory is an LRU with a TTL: it remembers enough recent turns to not be annoying, and forgets aggressively enough to stay cheap and stateless-ish. Bounded memory is the compromise between “useful in a conversation” and “doesn’t become a stateful service I have to operate.”¹

It’s not perfect, and the failure mode is instructive. Occasionally it inherits a parameter the user did mean to drop — they’ve moved on to a new question, but it’s phrased similarly enough that the old time range carries over silently. That’s the mirror image of the original bug: now it assumes by over-remembering instead of by guessing from nothing. The honest fix is to make the agent surface what it inherited — “for the same 30-day window:” as a quiet prefix on the answer — so a wrong inheritance is visible and correctable rather than silent. Visible-and-correctable beats silent — that keeps coming up across this whole project, and probably deserves its own post.

What I keep relearning: the impressive-sounding behaviour — an agent that asks smart clarifying questions — is the easy 80%. The last 20% — knowing when to not exercise the impressive behaviour — is what separates something people demo from something people use.

From a consulting project building natural-language analytics for restaurant businesses. Customer details, schema, and constants are abstracted; the reasoning is as built. Code is illustrative.

Footnotes

Why LRU + TTL specifically: active sessions stay warm, idle ones evict themselves, and there’s a hard ceiling on how much conversational state exists at once. If I ever needed sessions to survive across instances I’d reach for an external store, but that’s a real operational cost and I didn’t have the problem — most conversations are short and bursty.↩︎

Two tenants don’t need multi-tenancy

Umut Altun — Sun, 09 Nov 2025 00:00:00 GMT

The second customer changed the shape of the problem.

With one customer, a Text2SQL agent over their data is just an app. With two — and a third in the pipeline — it’s a multi-tenant system, and the one truly non-negotiable requirement is that no customer ever sees another’s numbers. A wrong chart is embarrassing. One restaurant seeing another’s revenue is the end of the contract.

So I did the responsible thing and sketched the proper multi-tenant architecture. One shared schema, a tenant_id on every table, row-level security, every query scoped by tenant. It’s the textbook answer, it’s efficient, and it scales to thousands of tenants. I had it on the whiteboard before I noticed I was about to make a serious mistake.

Here’s what I’d glossed over. In this system the queries aren’t written by me. They’re written by an LLM, at runtime, from a user’s plain-language question. The shared-schema design makes cross-tenant isolation a property of every generated query getting its WHERE tenant_id = … exactly right. And I’d just spent a week teaching that same model not to invent branch names. I was now proposing to make data isolation — the one thing I could not afford to get wrong — depend on the model’s discipline on every query, forever. One dropped predicate, one creative join, and tenant A is reading tenant B’s books.

You can defend against that, of course. You wrap generation in a layer that force-injects the tenant filter, you audit, you write tests. But now the most important guarantee in the whole system lives in a predicate I have to enforce correctly on every single LLM-written query for the life of the product. That’s an enormous surface area for something whose spec is “must never happen.”

So I didn’t build it. For two tenants — for the realistic near future of a handful — I went the other way: physical isolation. Each tenant gets its own BigQuery dataset, its own auth, and its own deployment of the same codebase, with the configuration swapped per tenant.

# one codebase, per-tenant config — isolation lives in the deployment,
# not in a WHERE clause the model has to remember every time
TENANT = load_config(os.environ["TENANT_ID"])
warehouse = Warehouse(project=TENANT.project, credentials=TENANT.creds)

# this instance's credentials can only reach this tenant's dataset.
# there is no cross-tenant query for the LLM to accidentally write —
# the data it shouldn't see isn't reachable from where it's running.

The difference is where the guarantee lives. In the shared-schema design, isolation is something the application has to do correctly on every request. In the per-tenant design, isolation is something the infrastructure makes impossible to violate. The model can write the worst query it likes; the credentials it runs under physically cannot reach another tenant’s data. I moved the guarantee from “we’ll get the filter right every time” to “there is nothing to get wrong.”

The trade-off is real, and it cuts the other way at scale. Physical isolation does not scale to thousands of tenants — past some point, standing up a deployment and a dataset per customer is the bottleneck, and the shared-schema model with its single efficient footprint wins decisively. It also costs more per tenant up front: a couple of managed containers and a dataset, tens of dollars a month each, multiplied out. Shared-schema amortizes all of that away.

But the failure modes are asymmetric, and that’s what settles it at my scale. The shared-schema model’s failure mode is cross-tenant leakage — the catastrophic one — and it sits one bug away at all times. The per-tenant model’s failure mode is “this got expensive and operationally annoying somewhere around N tenants” — a problem you watch approach for quarters and migrate against on your own schedule. One failure ends a contract; the other shows up on a cost dashboard. When the downsides are that lopsided, you don’t pick the architecture that’s elegant at a scale you don’t have. You pick the one whose worst case you can actually live with.

The mistake I almost made wasn’t technical. It was reaching for the architecture I’d be proud to have built — the one that handles thousands of tenants, the one that looks right on a whiteboard — instead of the one the problem in front of me actually needed. Two tenants don’t need multi-tenancy. They need to not see each other’s data, which is a different and much smaller requirement, and the smaller requirement has a much safer answer.

If this grows to fifty tenants I’ll revisit it, and shared-schema will probably win — by then the operational cost is real and the discipline to enforce tenant scoping properly is worth building. Recognizing which point on that curve you’re actually standing on, and resisting the pull toward the architecture that’s correct one curve over, is most of the job. I don’t always get it right. I got it right this time mostly because putting an LLM in the loop made the elegant option visibly scary.¹

From a consulting project building natural-language analytics for restaurant businesses. Customer details, infrastructure specifics, and numbers are abstracted; the reasoning is as built. Code is illustrative.

Footnotes

This isn’t an argument against row-level security in general — it’s excellent when humans or trusted application code write the queries. The specific thing that spooked me was putting an LLM-generated WHERE clause on the critical path of data isolation. Different threat model, different answer.↩︎

The obvious fix for a hallucinating SQL agent is the wrong one

Umut Altun — Wed, 15 Oct 2025 00:00:00 GMT

A few months into building a Text2SQL agent — natural language in, SQL out, for non-technical restaurant operators — I noticed the model kept inventing branch names.

It would write WHERE branch = 'Alsanck'. Close to a real branch, but not it: a dropped letter. The query didn’t error — it’s valid SQL, valid column, just no such value. It returned zero rows. And the operator on the other end, who couldn’t read SQL and had no reason to distrust the answer, saw zero sales and concluded one of their locations had flatlined.

That’s the kind of bug that scares me — the one that doesn’t crash but quietly hands a confident, wrong answer to someone who can’t tell it’s wrong.

The fix looks obvious. The model wrote 'Alsanck', the real value is 'Alsancak', the edit distance is one. Just snap it to the nearest real value before running the query. I wrote exactly that, felt clever about it, and it was a while before I realised that “snap to the nearest real value” was about the most dangerous thing I could have done.

Two ways to be wrong, and they don’t cost the same

Here’s what I missed at first. A bad filter value can fail in two directions, and the directions are not symmetric.

Under-correct. The typo slips through, the query returns zero rows, the user sees an empty result. Annoying — but visible. The user knows something’s off and rephrases. Recoverable.

Over-correct. The agent rewrites 'Alsanck' not into the branch the user meant, but into a different real branch that happens to be the closest string match. Now the user gets a complete, correct-looking report for the wrong location. They trust it. They order stock against it. This failure is invisible, and no rephrase recovers it, because nothing ever looked broken.

Framed that way the asymmetry is obvious: a visible miss is cheap, a silent swap is catastrophic. “Always correct to the nearest value” optimizes for the cheap failure and walks straight into the expensive one. (If you’re already nodding, this was probably obvious to you. It wasn’t to me — and I’d already shipped the eager version.)

So the rule I actually wanted wasn’t “fix typos.” It was: fix a typo only when you’re sure, and when you’re not, do nothing and let the miss stay visible.

Ground first, correct second

Two mechanisms — one before generation, one after.

Before: ground the model in real values. The schema I hand each agent isn’t just column names and types. For categorical columns it carries the actual distinct values from the table. The model isn’t asked to recall that a branch is called 'Alsancak' — it’s shown the set and told to pick from it. That turns a recall problem, which LLMs hallucinate their way through, into a selection problem, which they’re far better at. Most of the invented values died right here.

After: canonicalize, but conservatively. Grounding reduces bad values; it doesn’t eliminate them. So before a query runs, each equality filter is checked against the column’s real values, and a clear near-miss gets rewritten — with a lot riding on that word clear:

def canonicalize_filters(sql, distinct_values):
    # distinct_values: {column -> the real values in that column}
    for column, literal in equality_filters(sql):
        if literal in distinct_values[column]:
            continue                                  # already exact — leave it
        match, score = closest(literal, distinct_values[column])  # difflib ratio
        if score >= THRESHOLD:                        # one unambiguous near-miss
            sql = rewrite(sql, column, literal, match)
        # else: do nothing. a visible "no results" beats
        # a silent swap into the wrong real value.
    return sql

The entire design lives in THRESHOLD and the else branch I didn’t write. I tuned the threshold deliberately high — conservative — against real query logs. I’d rather let ten genuine typos through and show “no results” than auto-correct one value into the wrong neighbour. Being too shy here costs a mild annoyance; being too eager costs a wrong business decision. When the two mistakes are that lopsided, you tune for the expensive one and accept looking dumb on the cheap one.

Two choices in there I’d defend if pushed. It’s post-generation and deterministic on purpose. I could have tried to push all of this into the prompt — “only use values from this list, fix typos carefully” — but prompt behaviour drifts across model versions, and you can’t unit-test a vibe. String similarity is boring, cheap, stable, and testable with a table of inputs and expected outputs. Boring is a feature in the one component whose entire job is to not corrupt data.¹

What the benchmark misses

None of this shows up on a Text2SQL benchmark. Spider grades you on whether the SQL is correct, not on whether you avoided silently handing someone a confident lie. But in production, in front of a user who cannot check your work, “never wrong in a way they can’t see” is much closer to the metric that matters than exact-match accuracy.

I won’t pretend it’s solved. The threshold is hand-tuned, which is a generous way of saying I picked it by staring at logs until it felt right, and the honest next step is a small evaluation set so “did this change help?” has a number behind it instead of my gut. I knew that the whole time and shipped without it. Maybe that’s the next post.

From a consulting project building natural-language analytics over restaurant operations data. Customer details, schema, and the actual threshold are abstracted; the reasoning is as built.

Footnotes

The near-miss scoring is just difflib.SequenceMatcher ratio — Python standard library, nothing exotic. The interesting part was never the matching algorithm; it was deciding when not to trust it.↩︎

The router that’s allowed to say ‘I don’t know’

Umut Altun — Mon, 22 Sep 2025 00:00:00 GMT

My first version of the agent was one prompt to rule them all. The entire schema went into the context, the user’s question went at the bottom, and the model wrote SQL against all of it. It demos fine. Then someone asked a question that touched sales and staffing, and the model wrote a join between two tables that shared a column name and absolutely nothing else. The result was a confident, beautifully formatted table of nonsense.

The instinct is to fix the prompt. Add instructions, add warnings, add examples of good joins. I did some of that, and it helped a little, and then I realised I was treating a structural problem as a wording problem.

The structural problem is schema linking — mapping a vague natural-language question onto the specific tables and columns that answer it. It’s the genuinely hard part of Text2SQL, and it gets worse as you add tables, not better. My data had several distinct domains — sales, inventory, waste, staffing, and so on — and stuffing all of them into one context turned every question into a needle-in-haystack search across tables that often used the same words to mean different things. More schema in the prompt meant more ways to be confidently wrong. No amount of prompt-polish fixes that; you’re asking one call to both figure out what the question is about and write correct SQL for it in a single shot.

So I split it. Before any SQL gets written, the question goes to a small classifier whose only job is to decide which domain the question belongs to, and dispatch it to a specialist that carries only that domain’s schema. Sales questions go to the sales specialist, which has never heard of the staffing tables and therefore cannot join to them. Decomposition shrinks each specialist’s schema-linking problem from “all tables” to “the handful that matter,” which is the difference between a search and a lookup.

The router is deliberately a different kind of component from the specialists. It runs at temperature zero — routing should be reproducible, not creative — and it returns a structured decision, not prose:

class RoutingDecision(BaseModel):
    specialist: str        # which domain agent should handle this
    confidence: float      # 0.0–1.0
    alternative: str | None # the runner-up, if it was close

decision = router.classify(question)   # temp=0, JSON enforced at the model layer

Pulling routing out into its own deterministic step bought me two things I didn’t fully appreciate until later. The first is that it’s testable in isolation: I can run a fixed list of real questions through the router and assert where each one lands, without executing a single query. The creative step (writing SQL) and the step I need to be boringly predictable (deciding what the question is about) are now separable, and I can hold each to its own standard.

The second is the part I’d actually put on a slide: confidence as a first-class output, and the right to refuse. A classifier that always returns its top guess is a classifier that’s confidently wrong on every ambiguous question. “How did the weekend go?” could be revenue or footfall or labour cost. The honest answer is “I’m not sure which you mean,” and the only way to give that answer is to look at how sure the router actually is:

if decision.confidence < CUTOFF:           # hand-tuned, deliberately cautious
    return ask_user(decision.specialist, decision.alternative)
    # "Did you mean sales or staffing?" — one cheap question beats a wrong report.
dispatch(decision.specialist, question)

That CUTOFF is the whole philosophy in one constant. Below it, the system stops and asks a one-line clarifying question instead of charging ahead. It costs the user a round-trip; it saves them a confident answer to a question they didn’t ask. For non-technical users who can’t read the SQL to catch the mistake, that trade is worth it every time.

The honest cost of all this: it’s an extra LLM call on every single turn. More latency, more tokens, a second thing that can fail. I took that trade on purpose, because the alternative — folding routing back into generation to save the call — gives back the determinism, the testability, and the clean place to put the confidence check. I’d rather pay for a step I can reason about than save a call on a step I can’t.

If I were starting again I’d reach for the router on day one instead of discovering it the hard way. The single-prompt version isn’t a smaller version of the right design — it’s a different design that happens to look right until the schema grows past the size of a demo. The most useful thing I built into this system wasn’t a cleverer prompt. It was a component whose job includes knowing when it doesn’t know.¹

From a consulting project building natural-language analytics for restaurant businesses. Customer details, schema, and tuning constants are abstracted; the reasoning is as built. Code is illustrative.

Footnotes

Temperature zero on the router matters more than it sounds. You want the same question to route the same way every time — a router that occasionally changes its mind is a debugging nightmare, because now a bug reproduces only sometimes. Save the creativity for the step that writes SQL.↩︎

MMM wants two years of data; I had two months

Umut Altun — Tue, 24 Jun 2025 00:00:00 GMT

Every serious treatment of marketing mix modeling tells you the same thing: bring two or three years of data, ideally with real variation in spend, ideally a few natural experiments where you turned a channel off and watched. It’s sound advice. It also describes a situation I, and almost everyone actually doing this, rarely have. New games, new markets, a measurement need that’s urgent now — the request lands with a fraction of the history the textbook demands. The textbook answer is “then don’t run an MMM.” The job is to do something useful anyway, and the hard part is doing it without lying.

It helps to be precise about why MMM needs so much. You’re not fitting a line. You’re locating, per channel, an adstock decay and a saturation curve plus a coefficient, all at once, from channels that move together so the data can barely tell them apart. That’s a lot of curve to pin down, and the information to do it comes from variation — spend going up while another goes down, a channel pausing, a budget shock. Calendar time isn’t really what you’re short of. Variation is. Sixty quiet days where every channel drifts up together contain almost no information about any individual channel’s curve, no matter how many rows it is.

The wrong response — the tempting one, because it always produces output — is to let the model run regardless and report whatever it returns. It will return something. A maximum-likelihood fit always hands you numbers; Bayesian sampling always hands you a posterior. The numbers will be precise-looking and the dashboard will render them and a UA lead will, reasonably, treat them as real and move budget. That’s the actual danger of data-hunger: not that the model errors out, but that it returns a confident answer indistinguishable on the surface from a good one, built on data that couldn’t possibly support it. A model that fails loudly is safe. A model that fails quietly, with a clean-looking number, is how thin data becomes a bad decision.

So the design principle I leaned on is that the model has to know when it knows nothing, and say so. Concretely, two things.

First, a graceful fallback to the null answer. When a market is below the data it’d need — too few installs, too little spend movement — the system doesn’t fit a heroic model on noise. It returns the honest default: a k-factor of 1.0, “no measurable halo,” no claim of lift it can’t support.

# below the data floor, the model declines to invent a signal.
# k = 1.0 means "no halo detected", not "we measured zero halo".
if not enough_signal(market):
    return Result(k_factor=1.0, basis="insufficient_data")
fit_mmm(market)

The distinction in that comment is the entire ethic. “No halo detected” is an admission of ignorance; “we measured zero halo” is a measurement. The fallback is making the first claim, never the second — and tagging the output with its basis, so downstream nobody mistakes a default for a finding.

Second, and this is what makes the fallback honest rather than just convenient: the priors do double duty. In the Bayesian setup, when the data is too thin to say much, the posterior simply stays near the prior — which is the correct behaviour, as long as you read it correctly. A posterior sitting on top of its prior isn’t the data confirming your belief. It’s the data having nothing to add, and the model honestly reporting your starting assumption back to you. The failure isn’t the model returning the prior. The failure is you reading “posterior ≈ prior” as a result. So the discipline is to always check how far the posterior actually moved, and treat “it didn’t move” as a signal that this market can’t support a model yet — not as confirmation that you were right all along.

And the maturity of a modeling system isn’t in how well it performs when data is rich — it’s in how it behaves when data is poor. Anyone can fit a clean model to two years of varied history. The professional question is what your system does on the market with sixty flat days, because that’s where the quiet, confident, wrong numbers come from. A system that degrades to “I don’t have enough to say” is worth more than one that always produces a number, because the always-a-number system is indistinguishable from the honest one exactly when it’s lying. Building the model was the easy part. Building it to know the edge of its own competence, and stop at it, was the part that made it safe to put in front of someone spending real money.¹

From work on a marketing-analytics system for a mobile-gaming portfolio. Thresholds, channels, and numbers are abstracted; the reasoning is as built. Code is illustrative.

Footnotes

What counts as “enough signal” is itself a judgment I tuned rather than derived, and I won’t pretend otherwise — it’s a threshold on volume and on spend variation, set conservative, checked against markets where I had enough data to know the right answer and could see where the thin-data version started diverging from it. A more principled version would lean on the posterior’s own width — if the credible interval on a channel’s effect spans everything from “useless” to “incredible”, that is the model telling you it doesn’t know, in a language more honest than any threshold I’d hard-code.↩︎

One model per country, and the tax I paid for it

Umut Altun — Tue, 08 Apr 2025 00:00:00 GMT

My first marketing mix model was one model for the whole portfolio, all countries pooled. It fit fine, the diagnostics looked reasonable, and it was quietly describing a country that doesn’t exist.

Because a single global MMM estimates one set of channel effects — one TikTok coefficient, one saturation curve per channel — averaged across every market at once. And the markets are nothing alike. TikTok might be the dominant channel in one country and an afterthought in another; a dollar buys wildly different things in the US versus Brazil versus Indonesia; the saturation points differ by an order of magnitude because the addressable audiences do. Pool all of that and the model hands you the average channel effect across markets that share almost nothing — a blended number that’s correct for nowhere. Worse, it’ll confidently recommend shifting budget toward a channel that’s saturated in your biggest market just because it’s still cheap in a small one, because it can’t see the markets separately to know the difference.

So I split it: one MMM per country. Each market gets its own model, its own coefficients, its own adstock and saturation curves. Now the US model speaks for the US, and when it says TikTok is saturating, it means in the US, where you actually spend the money. The recommendations finally apply to a real place.

And the instant I did that, every model was starving.

This is the tax, and it’s unavoidable, so it’s worth stating plainly. MMM is data-hungry to begin with — you’re locating several nonlinear curves at once, and that needs a lot of history. Split your data by country and you’ve sliced that same history into N thinner piles, each feeding a model that’s just as hungry as the global one was. Your biggest markets have enough to fit on; your long tail of smaller countries have a handful of noisy days each, nowhere near enough to identify a curve. You traded one model that was confidently wrong for thirty models, half of which are individually too data-starved to trust. That’s not obviously a good trade, and pretending it is would be the easy lie.

What makes it a good trade is refusing to treat all thirty the same. Two things carry it.

First, a data threshold: a country only gets its own model if it clears a minimum volume. Below that line you don’t fit a desperate model on noise and present its output with a straight face — you fall back. Pool it into a regional or global model, or carry a portfolio-level prior. The small market gets a more pooled, more conservative estimate, the big market gets its own specific one, and you’re matching model granularity to the data each market can actually support instead of forcing one resolution on all of them.

# granularity follows the data, not the org chart
if country_volume(c) >= THRESHOLD:
    fit_country_model(c)          # enough signal to stand alone
else:
    fall_back_to_pooled(c)        # too thin: borrow strength, don't fake it

Second, and this is the bigger idea I only half-appreciated at the time: the right answer isn’t fully pooled or fully split — it’s partial pooling, and per-country-with-a-threshold is the poor man’s version of it. A proper hierarchical Bayesian model would let countries share a common prior and pull each country’s estimate toward the global mean in proportion to how little data it has — big markets dominated by their own data, small markets gracefully shrunk toward the portfolio average, all on one continuous dial instead of my hard in-or-out cutoff. I implemented the discrete version (own model above the line, pooled below) because it was simpler to build and operate and reason about, and it captures most of the benefit. But the hard threshold is a step function approximating a smooth one, with all the usual ugliness at the boundary — two nearly-identical small countries landing on opposite sides of the line and getting visibly different treatment. The hierarchical model is the right tool and it’s the upgrade I’d prioritize.

The way I think about it now: aggregation level is a bias-variance knob, and the extremes are almost never right. Fully pooled is maximum bias — one answer for places that have nothing in common. Fully split is maximum variance — every segment fits its own noise. The interesting work is always in the middle: how much should this segment speak for itself, and how much should it borrow from its neighbours, given how much data it actually has. Per-country MMM was my first real encounter with that question. It wasn’t the last — it’s the same question as cohort sizing, as geographic A/B analysis, as basically every problem where you have to decide how finely to slice before the slices turn to noise.¹

From work on a marketing-analytics system for a mobile-gaming portfolio. Markets, thresholds, and numbers are abstracted; the reasoning is as built. Code is illustrative.

Footnotes

The reason I’d push hard for the hierarchical version next: it makes the borrowing automatic and proportional instead of manual and binary. With a hard threshold I’m implicitly deciding how much a sub-threshold country should trust the global mean (answer: entirely) and a supra-threshold one should (answer: not at all), and both of those are wrong — a medium country should borrow somewhat. Partial pooling derives that “somewhat” from the data instead of making me pick a cliff, which is exactly the kind of judgment you want the model making rather than the config file.↩︎

iOS broke attribution, so I stopped attributing

Umut Altun — Tue, 21 Jan 2025 00:00:00 GMT

For years, mobile UA ran on a comfortable assumption: when someone installs your game, you know which ad they came from. A device identifier tied the tap on a Meta ad to the install that followed, deterministically, per user. Whole optimization stacks were built on that link. Then Apple’s App Tracking Transparency arrived, most users declined to be tracked, the identifier went dark, and that link quietly became fiction on iOS — replaced by SKAdNetwork, which hands back delayed, aggregated, deliberately privacy-fuzzed install counts instead of clean per-user trails.

The first instinct — mine included — was to treat this as a signal-quality problem and patch it. Model the missing conversions, reconstruct the probable attribution, stitch SKAdNetwork postbacks back into something resembling the old per-user view. A lot of very smart engineering went into this across the industry, and it always felt like what it was: building an ever-more-elaborate prosthesis for a limb that wasn’t coming back. You were estimating the per-user link, then building your decisions on the estimate as if it were the measurement, compounding one layer of uncertainty on top of another.

What actually helped was to stop asking “how do I recover attribution” and start asking “what was attribution ever for.” It was for deciding where the budget goes. And the marketing mix model I’d already built for organic lift answered that question without ever needing a per-user link — because MMM never looked at individual users in the first place. It works at the aggregate level: total spend per channel per day, total installs per day, the relationship between the two over time. It was, almost by accident, already privacy-proof. The thing ATT broke was a thing MMM never depended on.

That changed the posture completely. Instead of a degraded attribution signal propped up by modeling, I had a method whose required inputs — aggregate spend, aggregate installs — Apple’s changes didn’t touch at all. The privacy wall that demolished per-user attribution is no obstacle to a model that only ever read the totals.

But it forced one real change, and it’s the part worth writing down. On iOS, installs per channel are now the unreliable, privacy-fuzzed quantity — that’s exactly what SKAdNetwork degraded. So on iOS I stopped feeding the model paid installs and fed it the one number Apple can’t obscure: spend. You always know what you spent. Your own billing is ground truth, immune to anyone’s privacy policy. Android, where attribution still largely works, keeps using installs as the input. The platforms diverge on purpose:

# the input variable depends on what each platform can still measure honestly
if platform == "ios":
    media = paid_spend       # SKAdNetwork fogged installs; billing is ground truth
else:  # android
    media = paid_installs    # per-install attribution still reliable here

That looks like a small config branch, but the rule behind it matters: feed the model the most reliable observable, and let that differ by platform. On iOS the trustworthy observable is what left your bank account; on Android it’s still the install count. Rather than force both platforms through one pipeline and pretend iOS installs are as solid as Android’s — quietly poisoning the model with a number you know is fogged — you let each platform contribute the signal it can actually stand behind. The model on iOS answers “how do organic installs respond to spend,” the Android model answers “how do organic installs respond to paid installs,” and both are honest about their own inputs instead of one of them laundering a broken number.

There’s a real cost here. Spend and installs aren’t interchangeable inputs — spend folds in price (CPI moves with auction dynamics and seasonality, so a spend-based model partly measures cost fluctuation, not just volume), and the two platforms’ results no longer sit on the same axis, which makes a clean cross-platform total something you assemble carefully rather than read off. I accepted that. A coherent measurement built on a number that’s actually true beats an apples-to-apples comparison built on a number I know is fiction on one side.

The wider point: when a measurement breaks, your best move is often not to repair the measurement but to find a decision-making method that never required it. Reconstructing the old per-user signal was the locally obvious move and a strategic dead end — pouring effort into restoring a capability the platform had deliberately removed and would keep removing. Switching to a method that operates on data the privacy changes don’t touch was the move that aged well, because it stopped fighting the direction the whole ecosystem was visibly heading. The future of measurement in a privacy-first world looks a lot more like aggregate modeling and a lot less like following individual users around, and ATT was just the early, loud announcement of it.¹

From work on a marketing-analytics system for a mobile-gaming portfolio. Platform specifics, channels, and numbers are abstracted; the reasoning is as built. Code is illustrative.

Footnotes

This isn’t “attribution is dead, only model.” Deterministic attribution is still genuinely useful where it works (Android, web, logged-in surfaces), and incrementality tests remain the causal gold standard. The argument is narrower: don’t build your core budget decisions on a per-user signal the platform is actively dismantling. Use attribution where it’s honest, and make sure the decisions that really matter rest on something privacy changes can’t take away.↩︎

The installs you can’t attribute

Umut Altun — Tue, 12 Nov 2024 00:00:00 GMT

Here’s a number that should bother any UA team more than it does: a meaningful share of your “organic” installs aren’t organic. They were caused by your paid spend — just not in a way any attribution tool can see.

The mechanism is obvious once you say it out loud. You run a big Meta campaign. It drives paid installs, which attribution dutifully records. But it also pushes the game up the store charts, where new people discover it organically. Some of those paid users tell a friend, or just get seen playing. The campaign manufactured installs that nobody clicked an ad for — and your attribution tool, which can only credit an install it directly touched, files every one of them under “organic” and moves on. The spend gets none of the credit for the demand it actually created.

This is the halo effect, and it’s not a rounding error. If you optimize your UA purely on attributed paid ROI — which is what almost everyone does, because it’s what the dashboards show — you systematically underspend, because you’re crediting each channel only with the installs it directly touched and ignoring the wave of organic demand it set off behind them. You’re flying on an instrument that can’t see a chunk of the thing you’re trying to maximize.

So I stopped trying to measure paid ROI better and changed the question the model was answering. Instead of “how many paid installs did each channel get,” I pointed a marketing mix model at a different target entirely: organic installs. Not paid. Organic — the very installs attribution calls free — as the response variable, with paid spend per channel as the inputs.

That inversion is the whole idea, and it took me a while to be comfortable with how strange it looks. You’re regressing the thing you supposedly can’t buy onto the things you did buy, and asking: when paid spend moves, how much does the organic baseline move with it? Whatever the model can explain — the portion of organic installs that systematically rises and falls with paid spend, after adstock and saturation — is the halo. It’s paid-driven demand hiding in the organic numbers, and MMM can see it precisely because it works at the aggregate level, on correlations over time, instead of trying to trace individual clicks the way attribution does. Attribution asks “did this person touch an ad.” MMM asks “does organic move when spend moves.” Only the second question can find an install that nobody clicked.

The model decomposes organic installs into two parts. The baseline — the intercept, plus trend and seasonality — is the genuinely organic demand: what you’d get at zero paid spend, the brand, the back-catalogue, the season. The media contributions are the halo: the slice of organic installs that each channel’s spend is driving up on top of that baseline.

# target is ORGANIC installs. the decomposition splits them into the
# true baseline (what you'd get at zero spend) and the paid-driven halo.
contributions = create_media_baseline_contribution_df(
    media_mix_model=mmm,
    target_scaler=target_scaler,
    channel_names=channels,
)
# baseline  -> genuinely organic
# per-channel -> organic installs that paid spend manufactured

And then the number that actually changes the conversation — collapse it into a k-factor per channel, a virality multiplier:

# k = 1.15 means: every 100 paid installs from this channel come with
# ~15 organic installs in their wake that attribution credited to nobody.
k_factor = (paid_installs + halo_installs) / paid_installs

A k-factor of 1.0 means a channel buys exactly the installs it’s credited with and nothing spills over. Above 1.0 means it manufactures organic demand on top — and now you can compare channels on their true pull, paid plus halo, instead of the attributed-paid number that flatters the channels with no halo and punishes the ones quietly driving your charts. Two channels with identical attributed ROI can have very different real value once you count what they set off, and the team that knows that allocates budget differently from the team that doesn’t.

One caveat, and it’s a big one. This is a correlational decomposition, not a clean causal experiment. The model attributes the co-movement of organic installs and paid spend to the halo, but co-movement isn’t proof — a confounder that drives both (a seasonal surge, a press hit that coincided with a planned spend ramp) gets quietly absorbed into a channel’s contribution, and the model can’t tell that apart from genuine halo on its own. The gold standard for incrementality is a geo holdout or a proper lift test, where you actually withhold spend and watch what happens. MMM is the always-on, every-channel estimate you run when you can’t afford to stop spending everywhere to find out — and the right way to use it is to validate it against the occasional real lift test, lean hard on the priors, and read the k-factors as well-reasoned estimates rather than measured facts. I’d rather an honest estimate of the right quantity than a precise measurement of the wrong one, and attributed paid ROI is a precise measurement of the wrong one.

So your “organic” bucket is partly a measurement artifact — it’s where attribution files the demand it couldn’t trace, including the demand your own spend created. The most valuable thing a channel does might be the installs it doesn’t get credited for, and the only tools that can see those are the ones that stopped trying to follow the click.¹

From work on a marketing-analytics system for a mobile-gaming portfolio. Channels, k-factors, and numbers are abstracted; the reasoning is as built. Code is illustrative.

Footnotes

Why not just run lift tests everywhere and skip the modeling? Because a clean geo holdout means deliberately turning off spend in real markets and eating the lost installs to measure the counterfactual — expensive, slow, and politically hard to do on every channel every quarter. The pragmatic stack is both: occasional lift tests as ground truth, MMM as the continuous estimate calibrated against them. Neither alone is enough; the lift test is right but rare, the model is always-on but assumption-dependent.↩︎

Marketing isn’t linear, and neither is my model

Umut Altun — Tue, 17 Sep 2024 00:00:00 GMT

A plain regression of installs on spend makes two assumptions so obviously false that it’s a small miracle it works at all. It assumes a dollar spent on Tuesday affects Tuesday’s installs and nothing else. And it assumes the ten-thousandth dollar of the day drives exactly as many installs as the first. Both are wrong, in opposite directions, and getting marketing mix modeling to be useful is mostly about replacing each one with something that matches how advertising actually behaves.

Take the timing one first. You run a burst of TikTok spend today. Some installs land today — but some land tomorrow, and the day after, as people who saw the ad get around to acting on it, as the creative circulates, as the algorithm keeps serving it. The effect of today’s spend is smeared forward over the following days, decaying as it goes. A model that credits today’s spend only with today’s installs misreads this completely: it sees spend, then a lagged bump in installs it can’t connect to the cause, and it either misses the effect or pins it on whatever else happened to move that day.

The fix is adstock (carryover): before the spend ever enters the model, you transform it so each day inherits a decaying echo of the days before it.

# geometric adstock: today carries a fading memory of prior spend.
# lam ~ 0  -> effect is almost entirely same-day
# lam ~ 0.8 -> a long tail; today's spend still matters a week later
def adstock(spend, lam):
    out = np.zeros_like(spend)
    out[0] = spend[0]
    for t in range(1, len(spend)):
        out[t] = spend[t] + lam * out[t - 1]
    return out

The decay rate isn’t something you set — it’s something you learn, and that’s the point. A channel with a long carryover (brand-ish, slow-burn) and one with an instant, all-same-day response get different decay rates, and the data tells you which is which. In the Bayesian setup this lam is just another parameter with a prior, inferred alongside everything else.

Now the second assumption, which is worse, because it’s the one that makes the model’s advice dangerous. Advertising saturates. The first slice of budget on a channel hits the cheapest, most responsive users; as you pour in more, you’re reaching down into people who are progressively less interested, and each additional dollar buys fewer installs than the last. The response of installs to spend isn’t a line, it’s a curve that bends over — steep at first, flattening toward a ceiling. Diminishing returns, the most reliable empirical fact in all of performance marketing.

A linear model cannot represent this, and the failure isn’t academic. If installs-per-dollar is a constant slope, the model thinks the channel never saturates — so its honest recommendation is put infinite budget here, because the marginal return never drops. Every linear MMM, asked where to spend more, will eventually tell you to bet everything on one channel, because it has no concept of “full.” That’s not a minor glitch — it’s the model confidently recommending the one thing every marketer knows is wrong.

So spend goes through a saturation curve — a Hill curve, an S-shaped transform with parameters for where the bend sits and how sharp it is — and only then does it hit the linear part of the model:

# Hill saturation: turns spend into effective spend, with a ceiling.
# half-saturation point K and steepness s are LEARNED per channel.
def hill(spend, K, s):
    return spend**s / (spend**s + K**s)

The pipeline per channel is the composition of the two: raw spend → adstock (smear it forward in time) → saturation (bend it for diminishing returns) → then a linear coefficient. Which is the thing that finally made MMM click for me: it isn’t a regression with marketing-flavoured variables. It’s a structured curve-fitting problem where the shape of each curve is the marketing knowledge. The adstock decay encodes “how long does this channel’s effect linger,” the saturation curve encodes “how fast does this channel get tired,” and the linear coefficient on top is almost the least interesting parameter in the whole stack.

The cost is that you’ve added two nonlinear parameters per channel, and they trade off against each other in ways that make the fit harder and the identifiability problem sharper — a strong-carryover-low-saturation curve and a weak-carryover-high-saturation one can mimic each other over a short window. This is exactly why the priors matter and why MMM is data-hungry; you’re asking the data to locate several curves at once. But there’s no shortcut worth taking. The linearity assumptions aren’t a simplification you can defend as “good enough” — one of them blinds the model to lagged effects and the other makes it recommend infinite spend. The curves aren’t sophistication for its own sake. They’re the minimum required to not be actively wrong.

I’ve kept one habit from this: when a model gives obviously broken advice, look at the shape it’s assuming before anything else. A linear model recommending infinite budget isn’t mis-tuned. It’s faithfully reporting that a straight line has no top — and the fix is to give it a curve that does.¹

From work on a marketing-analytics system for a mobile-gaming portfolio. Channels, parameters, and numbers are abstracted; the reasoning is as built. Code is illustrative.

Footnotes

Adstock and saturation interact with order, and the order is a genuine modeling choice, not a detail. Adstock-then-saturation (smear time first, then bend) says the carryover accumulates as raw attention and then saturates; saturation-then-adstock says each day saturates on its own and the saturated effects carry forward. They give different curves, the libraries pick a convention, and it’s worth knowing which one you’ve signed up for rather than discovering it in a posterior you can’t explain.↩︎

Why my marketing model has opinions before it sees data

Umut Altun — Tue, 30 Jul 2024 00:00:00 GMT

The first marketing mix model I built was an ordinary regression: installs as the response, weekly spend per channel as the predictors, fit by least squares. It produced coefficients, the coefficients implied a return per channel, and the returns were nonsense — one channel came back with a negative effect, as if spending money on it actively destroyed installs.

I assumed I’d made a mistake. I hadn’t, not really. I’d just run face-first into the thing nobody tells you about marketing mix modeling: the data does not identify the answer. Several completely different stories about which channel drives what will fit your data about equally well, and ordinary regression picks one of them essentially at random — whichever one happens to drive the residuals to zero, including the ones that route a coefficient negative to cancel out a correlated neighbour.

The reason is structural. Marketing channels move together. When a game is doing well the team scales spend up across Meta and TikTok and Google all at once; when it’s cutting back, everything comes down together. So the spend columns are heavily correlated, and correlated predictors are poison for regression — the model can’t tell whether installs followed Meta or TikTok, because the two are nearly the same column. It splits the credit arbitrarily, and “arbitrarily” includes giving one channel a huge positive coefficient and its correlated twin a negative one. This is multicollinearity, and in MMM it isn’t an edge case, it’s the default condition. You can’t regularize your way out of it with ridge or lasso either — those stabilize the fit, but they shrink toward zero, which is its own unjustified opinion (“all channels are probably ineffective”) wearing the costume of neutrality.

What finally fixed it for me: the problem was never that I lacked an opinion about these channels. I had plenty. I knew roughly what a reasonable return looks like for paid UA in this space. I knew a channel can’t have a negative causal effect on installs — at worst it does nothing. I knew, within a factor, how the big networks tend to rank. I had all of this in my head, and I was running a method that threw it away and demanded the 60 days of data carry the entire load by themselves. No wonder it buckled.

So I switched to Bayesian MMM — in practice, Google’s LightweightMMM, which puts the whole thing in a probabilistic model and samples the posterior with NUTS via numpyro. And the single most important thing that buys you isn’t the fancy sampler or the credible intervals. It’s that the framework has a designed slot for the opinions you already hold. They’re called priors, and they turn “I have a hunch the model keeps ignoring” into a formal input the math has to respect.

# the model is told, before it sees a single day of data, roughly where
# each channel's effect should sit — and that effects are non-negative.
# the data updates these beliefs; it doesn't start from a blank slate.
mmm.fit(
    media=spend,                 # (n_days, n_channels), correlated columns
    media_prior=channel_priors,  # my domain belief about each channel's pull
    target=installs,
    number_warmup=1000, number_samples=1000,
)

The priors do the work that the data structurally can’t. Where the likelihood is flat — where the data genuinely can’t distinguish Meta from TikTok because they moved together — the posterior leans on the prior, and you get a sane, non-negative attribution near what you believed going in. Where the data is informative — a week where one channel moved and the others didn’t, a natural experiment the team didn’t know it ran — the likelihood dominates and the posterior moves off the prior toward what actually happened. The model spends its limited evidence updating the beliefs the data can actually speak to, instead of flailing on the ones it can’t. That’s exactly the behaviour you want, and it’s the behaviour OLS can’t give you because OLS has nowhere to put a belief.

Setting those priors well is its own craft, and an honest one — I’ve written separately about how I anchored them and the judgment that takes, because a prior is a claim and you should be able to defend it. But every MMM encodes prior beliefs. OLS encodes “I believe nothing, and I’m comfortable with a negative coefficient if it fits.” Ridge encodes “I believe every effect is probably near zero.” Neither of those is actually neutral — they’re just opinions held by accident, by people who didn’t realize they were choosing them. Bayesian MMM’s only real difference is that it makes you say your opinion out loud, in a place where someone can challenge it, instead of smuggling it in through your choice of regularizer.

I used to think putting priors on a model was cheating — tilting the result toward what I wanted. I had it backwards. The priors weren’t the bias. Pretending I didn’t have any was the bias, and the negative coefficient was what that pretense cost me. The model that states its opinions before it sees the data is the more honest one, because at least you can argue with it.¹

From work on a marketing-analytics system for a mobile-gaming portfolio. Channels, numbers, and priors are abstracted; the reasoning is as built. Code is illustrative.

Footnotes

There’s a real failure mode on the other side: priors so tight the data can never overrule them, at which point you’re not modeling, you’re just reading your assumptions back out with extra steps. The discipline is to set priors you’d defend as a starting belief and then check how far the posterior actually moved — if it never moves, your priors are too strong or your data is too weak to be running an MMM at all. That second possibility is worth taking seriously more often than people do.↩︎

A thousand simulations per cohort, and never a loop in sight

Umut Altun — Tue, 18 Jun 2024 00:00:00 GMT

The requirement: a thousand Monte Carlo draws per cohort, each draw a full horizon of retention and revenue, across thousands of cohorts, refreshed on a schedule, running on serverless functions that have a timeout and bill by the millisecond. Written the obvious way, that’s a non-starter. The reason it works anyway is a single discipline — never write the loop.

Monte Carlo carries a reputation for being the honest-but-too-slow option, the thing you’d love to use for proper confidence intervals if only you could afford it. That reputation comes entirely from the naive implementation, the one that reads like the textbook description: for each cohort, for each of a thousand simulations, for each day of the horizon, draw and accumulate. Three nested Python loops. It’s correct, it’s readable, and it is unusably slow — millions of Python-level iterations per cohort, interpreter overhead on every one, and you’ve blown the Lambda timeout on a single game.

The shift is to stop picturing a thousand simulations happening one after another, and start treating the whole batch as one object. The samplers already hand you the entire block — np.random.beta with a size of (simulations × days) returns all of it in one call — and every operation after that is array-at-a-time, executed in NumPy’s C internals instead of the Python interpreter:

# the sampler already returns the whole (N_SIMS, horizon) block at once
retention = np.random.beta(a, b, size=(N_SIMS, horizon))
arpdau    = np.random.gamma(shape, scale, size=(N_SIMS, horizon))

ltv = (retention * arpdau).sum(axis=1)          # N_SIMS lifetime values, zero loops
p10, p50, p90 = np.percentile(ltv, [10, 50, 90])

There is no for over simulations and no for over days. The element-wise multiply happens across the entire (simulations × horizon) array in one vectorized operation; the .sum(axis=1) collapses each simulation’s horizon into a single LTV, leaving a thousand of them; the percentiles read the confidence band straight off that. The thousand simulations don’t run in sequence — they run as one block of arithmetic the CPU is built to chew through. Per cohort, it drops from “blows the timeout” to comfortably sub-second, which is what lets thousands of cohorts fan out across serverless workers and finish on schedule.

The performance isn’t a vanity metric, which is the part that justifies caring. Sub-second per cohort is what makes the what-if tool feel alive — a UA lead asks “what happens to portfolio ROAS if I move budget from this channel to that one?” and the answer comes back while they’re still looking at the screen, because re-running the simulation across the affected cohorts is fast enough to be interactive. Had I left the loops in, that feature couldn’t exist; you don’t build an interactive simulator on top of a computation that takes a minute. Speed changed what the system was for, not just how fast it did the same thing.

There’s a real cost to vectorizing, and I’d be lying to pretend otherwise: the code gets harder to read. A loop with named variables tells you what it’s doing at each step; a stack of array operations asks you to carry the shapes in your head — is this (sims, days) or (days, sims), what does axis=1 collapse here — and a transposed axis is a silent bug that produces plausible, wrong numbers rather than an error. I leaned on shape comments and a fast test that diffs the vectorized output against the dead-simple looped version on a tiny input, so I’d know immediately if a refactor quietly broke the math. The loop version earns its keep as the oracle even though it never runs in production.

The rule of thumb I’ve leaned on ever since: if you’re writing a Python for loop over your data points, you’re usually leaving a one-to-two-orders-of-magnitude speedup on the table. The fix isn’t a faster language or a bigger machine — it’s expressing the computation as operations over whole arrays, so the work drops into compiled code. “Monte Carlo is too slow for production” almost always means “my Monte Carlo has a Python loop in it.” Take the loop out and the technique you wanted to use all along is suddenly affordable.¹

From work on a cohort-LTV system for a mobile-gaming portfolio. Specifics, parameters, and numbers are abstracted; the reasoning is as built. Code is illustrative.

Footnotes

Vectorization trades time for memory — a (sims × horizon) array per cohort is materialized all at once, and if you scaled simulations or horizon by 100× you’d hit memory limits before time ones. At this size it’s a non-issue and I provisioned the workers for the larger cohorts. But “vectorize everything” stops being free once the arrays get big enough to page, and then you’re back to batching — looping, but over big blocks instead of single elements.↩︎

Two LTV models that disagree, and the rule for which to believe

Umut Altun — Tue, 12 Mar 2024 00:00:00 GMT

I built two LTV models that routinely disagree with each other, and the most important code in the system is the dozen lines that decide which one to trust for a given cohort.

The reason there are two comes down to a fact about the input I couldn’t engineer away: cohorts arrive at wildly different levels of maturity, and a model that’s excellent for one is bad for the other. A cohort that installed this morning has almost no data — a day or two of retention, a trickle of revenue. A cohort from two months ago has a rich, fully-shaped curve. You’re asked to predict twelve-month LTV for both, from the same pipeline, and the honest truth is that no single model is good across that whole range.

The first model — call it the AR model — works the way the retention-curve approach suggests: fit retention, fit ARPDAU, integrate their product out to the horizon. When a cohort has enough data to actually fit those curves, this is the one you want — it’s mechanically faithful to how revenue accrues, and it’s precise. Starve it of data, though, and it’s worse than useless: you cannot fit a power law to three noisy points, and if you try, it’ll hand you a confident curve fit to nothing.

The second — the coefficient model — never fits a curve. It learns historical coefficients that map “revenue accumulated through day k” onto “revenue at the horizon,” normalized across cohorts to strip out scale. It’s cruder; it can’t capture a specific cohort’s curvature. But it degrades gracefully, because it needs almost nothing to produce a sane answer. On a day-old cohort it’s the robust choice precisely because it doesn’t try to be clever.

So they trade off exactly against each other along the maturity axis: AR is precise-but-fragile, coefficient is robust-but-blunt. The whole problem reduces to one question asked per cohort — which regime is this cohort in? — and the answer is just how much data it’s actually given me:

def choose_model(cohort):
    if cohort.avg_size_last_7d > VOLUME_CUTOFF:
        return AR_MODEL      # rich data: fit the curve, take the precision
    return COEF_MODEL        # sparse data: normalized coefficients, take the robustness

def predict(cohort):
    primary = choose_model(cohort)
    try:
        return primary.fit_predict(cohort)
    except FitError:
        return fallback_of(primary).fit_predict(cohort)   # never crash; degrade

High-volume cohorts get routed to precision; sparse ones get routed to robustness. And there’s a second safety net under the first: if the chosen model fails anyway — the AR fit doesn’t converge on a cohort that looked rich but was pathological — it falls back to the other one rather than throwing. The system’s contract is that it always returns a sane prediction, so “the preferred model didn’t work” can never become “the user gets an error.” Degrade, don’t die.

Which is why I think “what’s the best model?” is usually the wrong question — asking it keeps you tuning one model forever, trying to make it cover a range it structurally can’t. The better question is “what does this input let me get away with?” Here the input variable is data volume, and the answer is two specialists plus a cheap, legible router — and the router, the least glamorous part, is what makes the whole thing work in production.

One real weakness, and a sharp reader will spot it straight away: a hard threshold means two nearly-identical cohorts that land on opposite sides of the cutoff get handled by different models and can get visibly different predictions — a discontinuity right where the two models are least sure of each other. The better design is a blend: weight the two predictions by maturity and slide smoothly from coefficient to AR as a cohort ripens, instead of flipping. I know that, and I shipped the hard switch first because it was simple, debuggable, and good enough to be useful — the discontinuity sits in a region where both models are uncertain anyway, so it’s papered over by the confidence intervals. The smooth blend is the obvious next iteration. It’s just never quite outranked the things that were more broken.¹

From work on a cohort-LTV system for a mobile-gaming portfolio. Specifics, parameters, and numbers are abstracted; the reasoning is as built. Code is illustrative.

Footnotes

This isn’t a classic ensemble, and the difference matters. An ensemble averages models to reduce variance, assuming they’re all roughly valid. Here the models aren’t both valid at once — one is appropriate and the other is actively wrong for the cohort’s data regime. Averaging a good prediction with a known-bad one just contaminates it. Selection, not averaging, is the right tool when your models have disjoint domains of competence.↩︎

Old cohorts are lying to you

Umut Altun — Tue, 05 Dec 2023 00:00:00 GMT

For a while I had a bug I couldn’t explain: the more historical data I fed the model, the worse its predictions got. I’d widen the training window to include more past cohorts — more data, more signal, more better, surely — and the held-out error would creep up. I spent a day assuming I’d broken the aggregation somewhere before I accepted the data was telling me something true and uncomfortable.

To fit the curves, you pool past cohorts together — the retention shapes and revenue trajectories of cohorts that have already had time to mature, used as the template for cohorts that are still young. The instinct, drilled into all of us, is that more history makes that template more stable. More samples, tighter estimates. It’s true when the thing you’re measuring sits still.

A mobile game does not sit still. The developers are shipping into it every week — new monetization events, a balance patch, a LiveOps calendar, a reworked onboarding. A cohort from three months ago doesn’t describe the game — it describes the game as it was three months ago, which has since been patched out of existence. Pooling that cohort in with equal weight doesn’t add signal. It drags the model toward a reality that no longer exists, and the further back you reach, the more confidently wrong the template becomes. The data wasn’t lying about its own moment. It was lying about now, because I was treating a snapshot of the past as evidence about the present.

The assumption I’d never examined was stationarity — that the process generating the data is the same today as last quarter. For a live game it’s just false, and once you name it the fix is almost annoyingly simple: weight cohorts by how recent they are, so the present dominates and the past fades without being thrown away.

lag = (cohort_date - latest_date).days       # ≤ 0: how many days into the past
weight = np.exp(lag / CUTOFF_DAYS)           # recent ≈ 1, older decays toward 0
# a ~7-day-old cohort carries roughly 2.7x the weight of a ~14-day-old one
fit_weighted(curves, weights=weight)

Exponential decay on age. Recent cohorts count for nearly their full weight; older ones fade smoothly toward zero. The model now learns mostly from the game as it is this week, with older cohorts contributing a faint, fast-dimming echo. Widening the window stopped hurting, because reaching further back now adds vanishing weight instead of equal weight, and the predictions snapped back in line with reality.

The decay rate is the one real knob, and it’s a genuine tradeoff I tuned by hand rather than derived. Decay too aggressively and you’re effectively fitting on the last few days only — you overreact to every weekend blip and a single noisy cohort can yank the whole curve. Decay too gently and you’re back to letting stale cohorts vote. I picked a rate that felt right against the portfolio’s actual patch cadence — fast enough to track real changes, slow enough to ignore noise — which is to say I tuned it by hand until the curves behaved. A more principled version would detect when a game actually changed and weight around those breakpoints, instead of assuming a smooth global forgetting rate for every title. I didn’t build that; smooth decay was robust, cheap, and good enough, and “good enough and predictable” kept winning over “clever and fragile” on this project.

Now, whenever a model gets worse as I feed it more history, the first thing I check is whether I’ve assumed stationarity in a world that isn’t. We’re trained that more data is always better. In a non-stationary process it isn’t — old data isn’t just less informative, it’s actively misleading, faithful evidence about a regime that’s gone. Sometimes the most useful thing you can do for a model is teach it to forget on purpose.¹

From work on a cohort-LTV system for a mobile-gaming portfolio. Specifics, parameters, and numbers are abstracted; the reasoning is as built. Code is illustrative.

Footnotes

Why decay old cohorts toward zero instead of hard-cutting them at some age: the fade is graceful where a cutoff is a cliff. A hard window means a cohort is fully trusted one day and fully discarded the next, which makes the model lurch every time an influential cohort ages out. Exponential weighting has no edge to lurch at — yesterday’s most important cohort is today’s slightly-less-important one, and nothing ever falls off a shelf.↩︎

Why my LTV confidence intervals sample from two different distributions

Umut Altun — Tue, 19 Sep 2023 00:00:00 GMT

A point estimate is the wrong output. When a UA team asks what a cohort’s twelve-month LTV will be, “1.2× ROAS” is not an answer they can use — because it matters enormously whether that means “probably 1.2×, I’d bet on it” or “somewhere between 0.7× and 1.8×, genuinely no idea.” The first is a green light. The second is a coin flip wearing a green light’s clothes. The single number hides the one thing the decision actually turns on.

So the real output is a distribution, and the honest way to get one is Monte Carlo: sample the model’s parameters many times, compute the LTV under each draw, and read the spread off the resulting pile of outcomes. Simple enough in outline. The interesting part is the question that simulation forces you to answer and a point estimate lets you dodge — what do you sample from?

Because there are two quantities feeding an LTV estimate, and they are different kinds of thing, and pretending they’re the same is how you get confidence intervals that are technically computed and quietly meaningless.

The first is retention — what fraction of the cohort is still around on day d. That’s a rate: out of N users, some stuck and some didn’t. The natural distribution for “a rate, given counts” is the Beta, which is the conjugate of exactly that binomial process. And it has a property I got for free and came to love: it encodes sample size automatically.

# retention is a rate. Beta is the conjugate of "k stuck out of N".
# small cohort -> few counts -> wide Beta -> honestly uncertain
# big cohort   -> many counts -> tight Beta -> confidently precise
retention = np.random.beta(retained + EPS, churned + 1, size=(N_SIMS, horizon))

A cohort of fifty users and a cohort of fifty thousand might show the same observed retention, but the Beta drawn from fifty is wide and the one drawn from fifty thousand is narrow — so the uncertainty in the final LTV widens for small cohorts on its own, without a single rule telling it to. The distribution does the calibration. That’s the whole appeal: I’m not bolting uncertainty on afterward, I’m sampling from a thing whose shape already knows how sure it should be.

The second quantity is ARPDAU — average revenue per daily active user — and it is emphatically not a rate. It’s money: strictly positive, and savagely fat-tailed, because a handful of whales generate most of it while everyone else spends nothing. Sampling that from a Normal, the reflexive choice, is wrong twice over — a Normal will cheerfully hand you negative revenue, which is nonsense, and its thin tails will systematically understate the whales, who are the entire game. So ARPDAU comes from a Gamma: positive support, a fat right tail, parameterized from the revenue and the user count.

# ARPDAU is positive and fat-tailed (whales). Gamma fits; a Normal would
# emit negative revenue and amputate the tail that matters most.
arpdau = np.random.gamma(shape=rev_sum + EPS, scale=1/(users + 1), size=(N_SIMS, horizon))

revenue = retention * arpdau                         # per-user revenue, N_SIMS draws
p10, p50, p90 = np.percentile(revenue.sum(axis=1), [10, 50, 90])

Multiply the two sampled arrays, sum over the horizon, take percentiles, and you have a P10/P50/P90 that means something — the dashboards show that band, not a lone number, and the UA team reads “this cohort is probably fine but the floor is low” straight off the width.

So: the distribution you sample from is a modeling statement, not a default. Reaching for a Normal because it’s the one everybody remembers is a decision about the quantity — you’re asserting it’s symmetric, unbounded, thin-tailed — and for a retention rate or a revenue figure that assertion is just false. Choosing Beta for the rate and Gamma for the money isn’t statistical fussiness; it’s matching the shape of the randomness to the shape of the thing, and the confidence interval is only honest if you do. A wrong distribution won’t throw an error — it just hands you a number that looks like an uncertainty and isn’t.

I’ll be honest about the limit: “fat-tailed, so Gamma” is a defensible call, not a proven one. Whale spending might be better described by a lognormal or something heavier still, and I didn’t run that down as rigorously as I’d like — Gamma was principled enough, cheap to sample in bulk, and close enough that the intervals were useful. If I were hardening this I’d actually fit the tail and check. But the bones are right, and they were right because I started from “what kind of quantity is this?” instead of from whatever np.random function came to mind first.¹

From work on a cohort-LTV system for a mobile-gaming portfolio. Specifics, parameters, and numbers are abstracted; the reasoning is as built. Code is illustrative.

Footnotes

The Beta-binomial conjugacy isn’t just aesthetic. It means the posterior over the retention rate is a Beta in closed form, so I can sample it directly without MCMC or any fitting step — which is what keeps the whole Monte Carlo path cheap enough to run over thousands of cohorts on a schedule. The right distributional choice paid for itself in compute, not just in correctness.↩︎

Every retention curve has a kink, and one power law can’t fit it

Umut Altun — Tue, 20 Jun 2023 00:00:00 GMT

I’d settled on a two-parameter power law for retention, and for a while I was happy with it. Then I started overlaying the fitted curve on the actual data, cohort by cohort, and noticed it was good everywhere except two places: the first week, and the long tail. Which is an awkward way of saying it was good nowhere that mattered, because those two regions are exactly what a twelve-month LTV extrapolation hangs on.

The fit was splitting the difference. It ran a little high through the brutal early drop and a little low through the slow tail, landing a compromise curve that was wrong in both directions at once and right mainly in the middle, where I needed it least.

Once I stopped staring at the residuals and started thinking about the players, the reason was obvious. A retention curve isn’t one process, it’s two stuck together:

Days 1–8: the funnel flushing out. A lot of the install cohort was never going to stick — mis-targeted ads, curiosity installs, people who bounced off the tutorial. They churn fast and steep. This early region is governed by acquisition quality, not by the game.
Day 8 onward: the real curve. What’s left is the engaged core, and they decay slowly along a long, fat tail. This region is governed by the game, and it’s where almost all the lifetime value accumulates.

Those are two different decay regimes with two different exponents, and forcing one power law across both makes it average them — fitting through the kink between them instead of respecting it. The early steepness drags the tail estimate down; the tail’s flatness drags the early estimate up. You can’t win with one curve because there isn’t one curve.

So I fit piecewise — a separate power law in each regime — and made the number of pieces adapt to how much data the cohort actually has, because the failure mode at the other end is just as real: split a sparse cohort into three segments and you’re fitting noise three times instead of once.

n = len(observed_days)
if n < 21:
    segments = [(1, horizon)]                     # sparse: one fit, don't overfit
elif n < 45:
    segments = [(1, 8), (8, horizon)]             # split at the early/late knee
else:
    segments = [(1, 8), (8, 31), (31, horizon)]   # early drop, mid, long tail

# fit a power law within each segment; if a segment's fit fails to
# converge, fall back to a single full-span fit rather than emit garbage

A young cohort gets one honest fit. A mature one gets up to three, carving the early flush, the mid settling, and the long tail into their own curves. The kink at day 8 stops being an error the model fights and becomes a boundary the model respects.

Two things I’d flag as judgment calls rather than truths. The knee at day 8 I found by eye — overlaying fits across many games and noticing the early regime consistently gave out around there. It’s not sacred; a more principled system would detect the knee per cohort instead of hard-coding it, and a game with an unusual onboarding would want a different boundary. I used a fixed split because it was robust across the portfolio and I could reason about it, and a wrong-but-predictable boundary beats a clever knee-detector that occasionally finds a knee in noise. The other call is the fallback: a segment fit that doesn’t converge doesn’t get to emit nonsense, it collapses back to the single full-span fit. Degrade to the simpler model, never to a confident wrong one.

I’ve since seen the same thing well outside retention curves. When a model is wrong in a structured, repeatable way — high here, low there, every time — that’s not noise to tune away. It’s the data telling you your functional form is too simple for the process. The fix usually isn’t a fancier model. It’s looking at what’s actually generating the data, noticing it’s two things wearing a trenchcoat, and giving each one its own simple model.¹

From work on a cohort-LTV system for a mobile-gaming portfolio. Specifics, parameters, and numbers are abstracted; the reasoning is as built. Code is illustrative.

Footnotes

The adaptive piece count matters more than the exact thresholds. The principle — more data earns more flexibility, less data forces more humility — is what generalizes; the specific cutoffs are just where it landed for this portfolio after staring at a lot of fits. If you lifted this wholesale onto a different retention shape without re-checking the knee, you’d deserve what you got.↩︎

I trained a neural net to predict retention, then threw it away

Umut Altun — Tue, 14 Mar 2023 00:00:00 GMT

At one point I had a neural network that predicted retention curves measurably better than my two-parameter power law. I deleted it, kept the power law, and I’d make the same call again.

Here’s the setup. A mobile-gaming UA team has to decide, on day one of an install cohort, whether that cohort will be worth what they paid for it by month twelve. They cannot wait ninety days for the cohort to ripen — the budget decision is today, and the channel either gets more money tomorrow or it doesn’t. So you take the few days of retention you’ve actually observed, fit a curve, and extrapolate it out to the horizon. The model’s entire job is to extrapolate a tail it cannot see yet, from very little data.

I tried a neural net at this because of course I did — more capacity, more features, and on held-out cohorts it was, genuinely, a bit more accurate. That’s seductive. Then I tried to put it into production and three things killed it.

The first was data. Per game you don’t have millions of cohorts, you have a modest number, and a hungry model overfits them happily and confidently. The second was that it was off-distribution on exactly the cases I cared about most — a brand-new game, a country with no history — and it would produce something self-assured and wrong precisely when there was no ground truth to catch it. The third killed it for good: I couldn’t explain it. The UA lead would point at a cohort and ask “why is this one predicted low?” and the honest answer was “the network said so,” which is not an answer anyone can act on.

The power law has none of that capacity, and that turns out to be the feature.

from scipy.optimize import curve_fit

def power_law(day, a, b):
    return a * day ** b          # b < 0 → decay; b is the whole story

(a, b), _ = curve_fit(power_law, days, observed_retention, maxfev=1000)
# b ≈ -0.5  → a slow-burn game that holds its tail
# b ≈ -1.2  → a fast-drop game
# an analyst can read the shape of the game straight off the fit

Two parameters, and b is the entire personality of the curve. A gentle exponent is a slow-burn title that holds its tail; a steep one is a game that bleeds users fast. a is the level, b is the shape, and both mean something a human can name. When the UA lead asks why a cohort is predicted low, the answer is “its retention exponent is steep — look, it’s dropping faster than your portfolio average,” and you can point at it on a chart. The model is arguable. You can disagree with it, which means you can trust it.

(Why a power law and not an exponential, since both decay: retention curves are power-law shaped — a brutal early drop, then a long fat tail of the most engaged users. An exponential decays too fast and throws that tail away, and the tail is exactly where the lifetime value lives. Fitting the wrong functional form is a more expensive mistake than any amount of hyperparameter tuning.)

In production, feeding a human decision, an interpretable model that’s accurate enough beats an opaque one that’s a bit more accurate. Almost every time. Because the prediction isn’t the product — the decision is. A UA lead who can’t interrogate a number won’t bet a budget on it, and a prediction nobody bets on doesn’t change anything, no matter how good its backtest looked. Accuracy on held-out data is one axis. “Will a smart, skeptical operator actually act on this” is a different axis, and it’s usually the one that decides whether a model earns its keep or quietly gets ignored.

The neural net wasn’t wrong. It was wrong for this. Hand me orders of magnitude more data per game and a context where nobody needs the model to explain itself, and I’d reach for it without hesitation. But “won on the backtest” answered a question nobody on the UA team was actually asking. The power law is still in production years later. The net is in a git history somewhere, a little more accurate and completely unused.¹

From work on a cohort-LTV system for a mobile-gaming portfolio. Specifics, parameters, and numbers are abstracted; the reasoning is as built. Code is illustrative.

Footnotes

curve_fit does nonlinear least squares under the hood, and it’s reliable here precisely because there are only two parameters and the function is well-behaved — there’s almost nothing for the optimizer to get lost in. That stability is itself an argument for simple functional forms: a fit that converges the same way every time is one you can run unattended across thousands of cohorts without babysitting it.↩︎

Why I’m starting this blog

Umut Altun — Tue, 07 Feb 2023 00:00:00 GMT

I’ve been a senior data scientist for a few years now and I’ve avoided writing about the work publicly the entire time. That ends here. The reasons I avoided it were the usual ones — too busy, too proprietary, not enough new to say — and they were all weak. So this is the first post, and a brief note on what’s coming.

What I’ll write about

The work I do clusters around three areas: LTV and subscription prediction, marketing analytics (MMM, bid optimization, attribution), and agentic AI / LLM systems for data work. Posts will mostly be about one of those, occasionally about the cross-section.

Some things I’d actually like to write:

How a cohort-based LTV system handles the gap between “we have 30 days of data” and “we need a prediction for month 12.”
Why power-law fits keep beating neural nets on retention curves, and where they finally break down.
The honest version of when Marketing Mix Modeling is useful and when it’s hand-waving.
Building a Text2SQL agent that’s actually used in production — what works, what’s still broken.
The places where agentic AI is genuinely changing the shape of data work, and the places where it’s still cosplay.

What I won’t write

Generic “here’s how to do A/B testing” posts — there are already a thousand of those, written by people who explain it better than I will.
Anything that would expose proprietary data, architectures, or numbers from work I’ve done for employers or consulting clients. The systems I write about are real; the specifics will be abstracted enough to be safe.
Listicles, trend-chasing, AI hype. The half-life on that is about three weeks and I’d rather write things that hold up.

Why a writing habit matters more later in your career

This is the part I think about most. When I was junior, the case for blogging was straightforward — it builds your reputation, it’s a forcing function for learning, it gets you noticed. Fine.

But for a senior practitioner the calculus actually skews more in favor, not less:

The thinking is the bottleneck, not the doing. When you stop being graded on whether your code runs and start being graded on whether your choices were the right ones, writing them down is the only way to check yourself.
You accumulate strong opinions and never sanity-check them. Twenty colleagues nodding along in a meeting is not the same as a stranger telling you you’re wrong.
Your defaults become invisible to you. When you’ve made the same kind of decision a hundred times, you stop noticing you’re making it. Writing forces you to surface the defaults so you can check whether they still apply.

That’s the bet. We’ll see if I cash it in.

If you have thoughts on any of this — or if there’s something specific from the three areas above you’d want me to write about first — email me.