A multiple-choice question is only as good as its wrong answers

agentic AI
LLM
EdTech
Getting an LLM to write the correct answer to an exam question is trivial. Getting it to write three wrong answers that a half-prepared student would actually fall for is the entire problem — and it’s a problem about how students think, not how models do.
Author

Umut Altun

Published

July 29, 2025

I spent a while building a system that generates exam questions — multiple-choice, curriculum-aligned, the kind a teacher would actually hand out. And the thing that surprised me, the thing I’d have gotten wrong if you’d asked me beforehand, is which part is hard. The correct answer is trivial. Any competent model writes the right answer to a fifth-grade science question without breaking a sweat. The hard part — the part that decides whether the question is any good — is the three wrong answers.

Those wrong answers have a name in assessment design: distractors. And a multiple-choice question is, almost entirely, its distractors. The correct option is fixed by the curriculum; there’s one right answer and everyone agrees what it is. The distractors are where all the craft lives, because a question with bad distractors isn’t a question at all — it’s a giveaway with extra steps.

Watch what goes wrong when you let a model generate distractors naively. You get three failure modes, over and over. The distractors are too obviously wrong — the question asks what plants need to grow and the options are sunlight, water, and “a bicycle” — so any student eliminates them on sight and the question tests nothing. Or they’re subtly also correct — phrased loosely enough that a smart student can argue one of the “wrong” answers is defensible, which turns the question into a trap and makes the teacher look careless. Or they’re implausible in a way that’s not even tempting — technically wrong but so unrelated that no real misconception would ever lead a student there.

What all three have in common is that the model is generating wrong answers from the answer’s point of view — “give me three things that aren’t this” — when good distractors come from the student’s point of view: what does a kid who half-understands this actually believe?

That reframe is the whole job. A good distractor is a wrong answer that a specific, plausible misunderstanding leads to. It’s the number you get if you add when you should have multiplied. It’s the definition that’s right for the neighbouring concept the student confused this one with. It’s the answer that’s correct for grams when the question asked for kilograms. Each good distractor is a little theory about how a student goes wrong — and a student who almost knows the material should feel genuine pull toward exactly one of them.

So the fix wasn’t a better prompt asking for “plausible wrong answers.” It was giving the generator an explicit menu of distractor strategies — ways students reliably go wrong — and having it build each distractor from one:

For each distractor, choose a strategy and apply it:
  - common misconception   : the wrong belief students actually hold here
  - right method, wrong step: correct approach, one realistic slip
  - adjacent concept        : the definition of a nearby idea they confuse this with
  - unit / scale error      : right number, wrong unit or magnitude
Reject any distractor that is: defensibly correct, or that no real
student reasoning would produce.

Naming the strategies did two things. It made the distractors good — each one now corresponds to a real way a student fails, so the question discriminates between kids who understand and kids who don’t, which is the entire point of an assessment. And it made them diverse: three distractors built from three different strategies cover three different misconceptions, so the question probes the concept from several angles instead of testing the same slip three times.

The reject rule in there earns its place, because it guards the one failure that’s worse than an easy question: the ambiguous one. A distractor that’s defensibly correct doesn’t just make the question easy, it makes it wrong — the student who knows the most might pick it for a good reason and get marked down, which is the exact opposite of what an assessment is for. So “is this option arguably also right?” is a hard gate every distractor has to pass, and it’s worth spending a check on.

The thing I keep turning over from this project is that the interesting problem wasn’t really about language models at all. It was about assessment design — about how children misunderstand fractions and photosynthesis — and the model was just the tool that needed to be taught that domain. I went in thinking I was building a text generator. I spent most of my time learning what makes a wrong answer worth offering.1


From an EdTech project building an LLM-based exam-question generator. Curriculum, client, and prompt specifics are abstracted; the reasoning is as built.

Footnotes

  1. There’s a depth axis underneath this that distractor strategies interact with: a question testing simple recall has a narrow space of plausible wrong answers, while one testing application or analysis has a rich space, because there are more steps to slip on. The harder the cognitive level you’re targeting, the more the distractor strategies have to offer — which means the same machinery that makes wrong answers plausible also, usefully, makes it visible when a question is shallower than it pretends to be.↩︎