The eight silent seconds that made my app feel broken
I watched three different people decide my app was broken. It wasn’t. It was just quiet.
Here’s what happened. A user types a question, and behind the scenes the agent does real work: it routes the question to the right specialist, writes SQL, runs it against the warehouse, then summarises the result into a sentence and a chart. End to end, that’s around eight seconds — a couple of LLM calls and a database query, none of it unreasonable. And the answer that comes back is correct.
But for those eight seconds, the screen showed nothing. A spinner, maybe, doing its little spin. And what I saw, over and over, was the user wait about four seconds, conclude it had hung, and refresh the page — which of course threw away the in-flight request and started the whole thing over, making it worse. They weren’t wrong to do it. Every signal the interface gave them said “nothing is happening here.” Silence reads as failure.
My first instinct was to make it faster. That’s the engineer’s reflex: latency too high, drive it down. I spent a little while there before admitting that eight seconds for two LLM calls and a query is already roughly where it’s going to be, and shaving it to six wouldn’t change anything — six silent seconds also reads as broken. I was optimising the wrong quantity. The problem was never the latency. It was the silence. Users don’t actually mind waiting if they can see that something is happening and that it’s happening for them.
So I stopped trying to make it faster and started making it legible. The pipeline already moves through distinct stages — routing, writing the query, running it, summarising — so instead of hiding those behind one spinner, I streamed them to the front end as they happened, over server-sent events:
routing your question…
writing the query…
running it against your data…
summarising the result…
Same eight seconds. Completely different experience. Now the wait has a narrative: the user watches the system think, each line proof that work is being done on their behalf, and nobody refreshes anymore because nothing ever looks stuck. I changed the perceived latency without touching the actual latency by a millisecond. For an interactive tool, perceived latency is the one that pays rent.
And then it didn’t work, which is the part worth writing down. I wired up the events, the backend emitted them in order, and the front end received… all of them at once, in a single clump, right at the end — exactly the silence I’d built this to kill, now with extra machinery. The backend was streaming correctly. Something between the backend and the browser was holding the whole response back and delivering it in one piece.
That something was the reverse proxy. By default it buffers responses — a perfectly sensible optimisation for normal request/response traffic, and the precise opposite of what streaming needs. It was collecting my carefully-streamed events into a buffer and flushing them together, defeating the entire point. The fix is two lines, and finding them took an embarrassing fraction of an afternoon:
location /api/ {
proxy_buffering off; # stream chunks through, don't accumulate
add_header X-Accel-Buffering no; # belt and suspenders for the same thing
}
That’s it. That’s the difference between a progress stream and a spinner that sits dead for eight seconds and then dumps everything at once.
I keep that snippet around partly because I’ll need it again and partly as a reminder of what “productionising an LLM app” actually consists of. The prompt engineering and the model choice get the attention, but a real share of the work that decides whether people keep using the thing is unglamorous plumbing exactly like this: streaming, buffering, timeouts, the difference between correct and correct-and-legible. None of it shows up in a demo, because a demo is one person who already knows it works and is willing to wait. Production is a stranger who assumes it’s broken the moment it goes quiet.1
So now, when something feels slow, I ask whether the problem is the duration or the silence before I spend a week chasing the duration. Often the cheapest fix isn’t making the work faster — it’s making the work visible.
From a consulting project building natural-language analytics for restaurant businesses. Customer details and infrastructure specifics are abstracted; the reasoning is as built. Code is illustrative.
Footnotes
I used server-sent events rather than WebSockets on purpose: the data only flows one way (server to client), SSE is just HTTP so it sails through proxies and load balancers with far less ceremony, and it reconnects on its own. WebSockets would have been a heavier answer to a one-directional question.↩︎