One model per country, and the tax I paid for it

marketing analytics
MMM
production ML
A single global marketing mix model averages the US and Brazil into a country that doesn’t exist. Splitting it per country fixes that and immediately starves each model of data. On choosing which problem you’d rather have.
Author

Umut Altun

Published

April 8, 2025

My first marketing mix model was one model for the whole portfolio, all countries pooled. It fit fine, the diagnostics looked reasonable, and it was quietly describing a country that doesn’t exist.

Because a single global MMM estimates one set of channel effects — one TikTok coefficient, one saturation curve per channel — averaged across every market at once. And the markets are nothing alike. TikTok might be the dominant channel in one country and an afterthought in another; a dollar buys wildly different things in the US versus Brazil versus Indonesia; the saturation points differ by an order of magnitude because the addressable audiences do. Pool all of that and the model hands you the average channel effect across markets that share almost nothing — a blended number that’s correct for nowhere. Worse, it’ll confidently recommend shifting budget toward a channel that’s saturated in your biggest market just because it’s still cheap in a small one, because it can’t see the markets separately to know the difference.

So I split it: one MMM per country. Each market gets its own model, its own coefficients, its own adstock and saturation curves. Now the US model speaks for the US, and when it says TikTok is saturating, it means in the US, where you actually spend the money. The recommendations finally apply to a real place.

And the instant I did that, every model was starving.

This is the tax, and it’s unavoidable, so it’s worth stating plainly. MMM is data-hungry to begin with — you’re locating several nonlinear curves at once, and that needs a lot of history. Split your data by country and you’ve sliced that same history into N thinner piles, each feeding a model that’s just as hungry as the global one was. Your biggest markets have enough to fit on; your long tail of smaller countries have a handful of noisy days each, nowhere near enough to identify a curve. You traded one model that was confidently wrong for thirty models, half of which are individually too data-starved to trust. That’s not obviously a good trade, and pretending it is would be the easy lie.

What makes it a good trade is refusing to treat all thirty the same. Two things carry it.

First, a data threshold: a country only gets its own model if it clears a minimum volume. Below that line you don’t fit a desperate model on noise and present its output with a straight face — you fall back. Pool it into a regional or global model, or carry a portfolio-level prior. The small market gets a more pooled, more conservative estimate, the big market gets its own specific one, and you’re matching model granularity to the data each market can actually support instead of forcing one resolution on all of them.

# granularity follows the data, not the org chart
if country_volume(c) >= THRESHOLD:
    fit_country_model(c)          # enough signal to stand alone
else:
    fall_back_to_pooled(c)        # too thin: borrow strength, don't fake it

Second, and this is the bigger idea I only half-appreciated at the time: the right answer isn’t fully pooled or fully split — it’s partial pooling, and per-country-with-a-threshold is the poor man’s version of it. A proper hierarchical Bayesian model would let countries share a common prior and pull each country’s estimate toward the global mean in proportion to how little data it has — big markets dominated by their own data, small markets gracefully shrunk toward the portfolio average, all on one continuous dial instead of my hard in-or-out cutoff. I implemented the discrete version (own model above the line, pooled below) because it was simpler to build and operate and reason about, and it captures most of the benefit. But the hard threshold is a step function approximating a smooth one, with all the usual ugliness at the boundary — two nearly-identical small countries landing on opposite sides of the line and getting visibly different treatment. The hierarchical model is the right tool and it’s the upgrade I’d prioritize.

The way I think about it now: aggregation level is a bias-variance knob, and the extremes are almost never right. Fully pooled is maximum bias — one answer for places that have nothing in common. Fully split is maximum variance — every segment fits its own noise. The interesting work is always in the middle: how much should this segment speak for itself, and how much should it borrow from its neighbours, given how much data it actually has. Per-country MMM was my first real encounter with that question. It wasn’t the last — it’s the same question as cohort sizing, as geographic A/B analysis, as basically every problem where you have to decide how finely to slice before the slices turn to noise.1


From work on a marketing-analytics system for a mobile-gaming portfolio. Markets, thresholds, and numbers are abstracted; the reasoning is as built. Code is illustrative.

Footnotes

  1. The reason I’d push hard for the hierarchical version next: it makes the borrowing automatic and proportional instead of manual and binary. With a hard threshold I’m implicitly deciding how much a sub-threshold country should trust the global mean (answer: entirely) and a supra-threshold one should (answer: not at all), and both of those are wrong — a medium country should borrow somewhat. Partial pooling derives that “somewhat” from the data instead of making me pick a cliff, which is exactly the kind of judgment you want the model making rather than the config file.↩︎