Why my marketing model has opinions before it sees data
The first marketing mix model I built was an ordinary regression: installs as the response, weekly spend per channel as the predictors, fit by least squares. It produced coefficients, the coefficients implied a return per channel, and the returns were nonsense — one channel came back with a negative effect, as if spending money on it actively destroyed installs.
I assumed I’d made a mistake. I hadn’t, not really. I’d just run face-first into the thing nobody tells you about marketing mix modeling: the data does not identify the answer. Several completely different stories about which channel drives what will fit your data about equally well, and ordinary regression picks one of them essentially at random — whichever one happens to drive the residuals to zero, including the ones that route a coefficient negative to cancel out a correlated neighbour.
The reason is structural. Marketing channels move together. When a game is doing well the team scales spend up across Meta and TikTok and Google all at once; when it’s cutting back, everything comes down together. So the spend columns are heavily correlated, and correlated predictors are poison for regression — the model can’t tell whether installs followed Meta or TikTok, because the two are nearly the same column. It splits the credit arbitrarily, and “arbitrarily” includes giving one channel a huge positive coefficient and its correlated twin a negative one. This is multicollinearity, and in MMM it isn’t an edge case, it’s the default condition. You can’t regularize your way out of it with ridge or lasso either — those stabilize the fit, but they shrink toward zero, which is its own unjustified opinion (“all channels are probably ineffective”) wearing the costume of neutrality.
What finally fixed it for me: the problem was never that I lacked an opinion about these channels. I had plenty. I knew roughly what a reasonable return looks like for paid UA in this space. I knew a channel can’t have a negative causal effect on installs — at worst it does nothing. I knew, within a factor, how the big networks tend to rank. I had all of this in my head, and I was running a method that threw it away and demanded the 60 days of data carry the entire load by themselves. No wonder it buckled.
So I switched to Bayesian MMM — in practice, Google’s LightweightMMM, which puts the whole thing in a probabilistic model and samples the posterior with NUTS via numpyro. And the single most important thing that buys you isn’t the fancy sampler or the credible intervals. It’s that the framework has a designed slot for the opinions you already hold. They’re called priors, and they turn “I have a hunch the model keeps ignoring” into a formal input the math has to respect.
# the model is told, before it sees a single day of data, roughly where
# each channel's effect should sit — and that effects are non-negative.
# the data updates these beliefs; it doesn't start from a blank slate.
mmm.fit(
media=spend, # (n_days, n_channels), correlated columns
media_prior=channel_priors, # my domain belief about each channel's pull
target=installs,
number_warmup=1000, number_samples=1000,
)The priors do the work that the data structurally can’t. Where the likelihood is flat — where the data genuinely can’t distinguish Meta from TikTok because they moved together — the posterior leans on the prior, and you get a sane, non-negative attribution near what you believed going in. Where the data is informative — a week where one channel moved and the others didn’t, a natural experiment the team didn’t know it ran — the likelihood dominates and the posterior moves off the prior toward what actually happened. The model spends its limited evidence updating the beliefs the data can actually speak to, instead of flailing on the ones it can’t. That’s exactly the behaviour you want, and it’s the behaviour OLS can’t give you because OLS has nowhere to put a belief.
Setting those priors well is its own craft, and an honest one — I’ve written separately about how I anchored them and the judgment that takes, because a prior is a claim and you should be able to defend it. But every MMM encodes prior beliefs. OLS encodes “I believe nothing, and I’m comfortable with a negative coefficient if it fits.” Ridge encodes “I believe every effect is probably near zero.” Neither of those is actually neutral — they’re just opinions held by accident, by people who didn’t realize they were choosing them. Bayesian MMM’s only real difference is that it makes you say your opinion out loud, in a place where someone can challenge it, instead of smuggling it in through your choice of regularizer.
I used to think putting priors on a model was cheating — tilting the result toward what I wanted. I had it backwards. The priors weren’t the bias. Pretending I didn’t have any was the bias, and the negative coefficient was what that pretense cost me. The model that states its opinions before it sees the data is the more honest one, because at least you can argue with it.1
From work on a marketing-analytics system for a mobile-gaming portfolio. Channels, numbers, and priors are abstracted; the reasoning is as built. Code is illustrative.
Footnotes
There’s a real failure mode on the other side: priors so tight the data can never overrule them, at which point you’re not modeling, you’re just reading your assumptions back out with extra steps. The discipline is to set priors you’d defend as a starting belief and then check how far the posterior actually moved — if it never moves, your priors are too strong or your data is too weak to be running an MMM at all. That second possibility is worth taking seriously more often than people do.↩︎