Grounding an agent in documents it isn’t allowed to get wrong
My MSc thesis put an LLM to work on telecom network configuration — taking a goal expressed in plain language and producing the parameter settings to achieve it, grounded in the operator’s own documentation. It’s a domain I’d worked in before the degree, and the thing that shaped the entire design is something I’d internalised there: in a live network, a wrong configuration value isn’t an inconvenience. A misconfigured radio parameter degrades a cell, and that’s real people losing signal in a real place. Confidently wrong is the worst possible failure mode, and a language model’s default behaviour is confidently anything.
So the central problem was never “can the model produce a plausible configuration.” Of course it can — it’ll produce a beautifully formatted, authoritative-sounding parameter set whether or not a single value is correct. The problem was making sure every value it produced came from the documentation rather than from the model’s own half-remembered training data. The specs are dense, version-specific, and full of values that look interchangeable but aren’t. A model running on parametric memory will cheerfully give you a parameter that was right in some other release, for some other vendor, with total confidence. The fluency is exactly what makes it dangerous.
This is what retrieval-augmented generation is for, and I think it’s often explained backwards — as a way to give the model “more knowledge.” That’s not the point that matters here. The point is to change where the answer comes from. Without retrieval, the model answers from its weights — an averaged, lossy compression of everything it read in training, with no notion of which source said what or whether it’s current. With retrieval, you fetch the actual relevant passages from the actual documentation at question time, put them in front of the model, and constrain it to build its answer from those passages. You’re not topping up its memory. You’re moving the source of truth out of the weights, where you can’t inspect or trust it, and into a set of documents you control.
question -> retrieve the relevant spec passages from the real docs
-> answer USING ONLY those passages, and cite which one for each value
-> if the passages don't contain the answer, say so — don't fill the gap from memory
That last line is the one that does the real work, and it’s the discipline most RAG systems are too lax about. Retrieval gets you the right pages; it doesn’t automatically stop the model from also leaning on its parametric memory when the retrieved text is thin. In a low-stakes setting, a model that quietly supplements the documents with its own guesses is fine, even helpful. In network configuration it’s the whole danger reintroduced through the side door — a value that looks retrieved but was actually invented. So a large part of the work was forcing the failure to be honest: when the documentation doesn’t contain the answer, the correct output is “the docs don’t specify this,” not a confident fabrication that happens to be ungrounded. An agent that admits the gap is safe. An agent that papers over it with a plausible number is exactly the thing you were trying to prevent.
The other half was making the grounding auditable, which falls out naturally once the answer comes from retrieved passages. Every value the system proposed could point back to the specific passage it came from. That traceability isn’t a nicety in this domain — it’s what lets a network engineer trust the output enough to act on it, because they can check the source rather than taking the model’s word. A configuration you can’t trace back to a document is a configuration nobody responsible will deploy, no matter how confident the model sounded. The citation is the difference between a suggestion an expert can verify and a black box they have to either blindly trust or ignore.
It’s the same pattern across everything I’ve built with these models since: an LLM gets useful in serious settings not when you trust it more, but when you arrange things so you have to trust it less. Retrieval grounds it in sources you can check; forcing it to abstain when the sources are silent stops it filling gaps with confident noise; citations make every claim verifiable. None of that makes the model smarter. It makes the output checkable — and checkable is the bar for anything where being wrong has a cost.1
From my MSc thesis on agentic AI for telecom network configuration. The domain specifics are generalised; the design reasoning is as built.
Footnotes
The “agentic” part of the thesis — multiple steps, the model planning and calling tools rather than answering in one shot — raises the stakes on all of this rather than easing it. A single ungrounded value early in a multi-step configuration propagates: later steps build on it, and the final output is confidently, elaborately wrong in a way that’s harder to spot than a single bad answer. The more autonomous the pipeline, the more the grounding and the abstention have to hold at every step, because there’s no human reading each intermediate result and catching the one that drifted.↩︎