Catching an LLM’s mistakes without asking another LLM

agentic AI

LLM

EdTech

The model that writes your content is the wrong tool to check it — especially for the things models are reliably bad at, like counting. A lot of LLM quality control is just plain code, run before you spend a second API call.

Author

Umut Altun

Published

August 26, 2025

The exam questions my generator produced had rules. A reading passage had to be roughly sixty to eighty words — long enough to hold a question, short enough for a ten-year-old. The stem had to be one or two sentences, not a paragraph. Exactly four options. No option dramatically longer than the others, because length is a tell that hands away the answer. Simple, mechanical rules, the kind a teacher applies without thinking.

The model broke them constantly. Not the hard stuff — the distractors were genuinely good — but the counting. A passage that was ninety words when I asked for seventy. Five options instead of four. A stem that ran on for four sentences. These are exactly the things language models are worst at, because they don’t actually count; they generate text that feels about the right length, and “feels like seventy words” is regularly ninety.

My first instinct was the obvious one: add a second LLM call to check the first. A validator prompt — “here’s a question, does it follow these rules, reply with the violations.” It’s the natural move, everyone reaches for it, and for this particular job it’s slightly absurd. I’d be asking a model that can’t count to seventy to verify whether another model counted to seventy. Same weakness, doubled, plus another API call, plus latency, plus a new way to be wrong. You don’t fix “the model can’t count” by adding more model.

Because here’s the thing: I can count. Or rather, ten lines of Python can, perfectly, instantly, for free. Every rule that broke was a rule a deterministic check enforces flawlessly — word counts, option counts, sentence counts, length ratios. These aren’t judgment calls that need a model’s understanding. They’re arithmetic, and arithmetic is the one thing the LLM in the loop is structurally bad at and plain code is structurally perfect at. The validation belonged in code, not in another prompt.

def check(question):
    violations = []
    n = wordcount(question.passage)
    if not 60 <= n <= 80:
        violations.append(f"passage is {n} words; must be 60-80")
    if len(question.options) != 4:
        violations.append(f"{len(question.options)} options; must be exactly 4")
    if sentences(question.stem) > 2:
        violations.append("stem is more than 2 sentences")
    return violations   # empty == valid. no model, no tokens, no latency.

Run that before you call any model again, and most of the failures are caught for nothing. But catching them is only half of it. The half that made this actually work is what you do with the violations: you don’t discard the question and regenerate from scratch, and you don’t just retry and hope. You feed the specific violations back to the generator as targeted feedback — “the passage is 92 words, bring it under 80” — and let it fix that exact thing. A retry that knows precisely what was wrong succeeds far more often than a blind one, because you’ve turned a vague “try again” into a concrete instruction, and you’ve spent zero model calls figuring out what to say.

The division of labour is what makes this work, and it goes well past exam questions: let the model do the part that needs understanding, and let plain code do the part that needs precision. Generating a plausible reading passage with good distractors needs understanding — that’s irreducibly the model’s job, there’s no Python for “write something a fifth-grader would find interesting.” Verifying it’s the right length needs precision — that’s irreducibly code’s job, and routing it through a model just launders a deterministic fact through a probabilistic process and makes it worse.

There’s a cost-shaped reason this matters too, beyond correctness. Deterministic checks are free and instant; LLM calls are neither. In a system generating thousands of questions, putting the cheap, reliable check first — and only spending a model call on the targeted fix when something genuinely fails — is the difference between a pipeline that’s affordable and one that quietly triples your bill validating things a regex could have settled.

I think the reflex to solve every LLM problem with more LLM is one of the easy mistakes of building with these models right now. They’re so capable at the hard parts that it’s tempting to hand them the easy parts too. But the easy parts — counting, formatting, structure — are often precisely where they’re weakest, and where the boring old tools are not just cheaper but flatly better.¹

From an EdTech project building an LLM-based exam-question generator. Curriculum, client, and prompt specifics are abstracted; the reasoning is as built. Code is illustrative.

Footnotes

Why models can’t count reliably, briefly: they operate on tokens, not characters or words, and a word might be one token or several, so “how many words” isn’t even cleanly visible to the model in the way it is to you. Asking an LLM for an exact count is asking it to do arithmetic on a representation it can’t see clearly. Tool use — letting the model call a counting function — is the principled fix when the count has to happen mid-reasoning; for post-hoc validation, just check it yourself.↩︎