Designing an LLM system that thinks in Turkish

LLM

Text2SQL

localization

Building a Text2SQL agent in Turkish turned out to be five percent about the model and ninety-five percent about everything around it. A note on staying in-language end to end, the two-byte bug that shredded every export, and the English-by-default tax.

Author

Umut Altun

Published

May 25, 2026

The bug report was a screenshot of a spreadsheet where every Turkish character had been shredded — çalışan showing up as Ã§alÄ±ÅŸan. The data was correct. The numbers were right. It just looked like garbage, and to the operator who’d downloaded it, “looks broken” and “is broken” are the same thing.

That screenshot is my favourite example of something I badly underestimated: building an LLM system that works in Turkish is about five percent the model and ninety-five percent everything around it.

The model part is the easy part, which genuinely surprised me. The users ask in Turkish, the schema is in Turkish — column names, and categorical values like şube (branch) or zayi (waste) — and the answers need to come back in Turkish. I’d braced for this to be the hard problem and it mostly wasn’t: modern models handle Turkish comfortably, and I write the system prompts in Turkish too, so the whole pipeline stays in-language instead of translating in and out at the edges. That’s the one model-level decision I’d insist on — stay in Turkish end to end, never round-trip through English — because every translation hop is a place for a nuance or a proper noun to get quietly mangled.

The hard part is that the entire ecosystem around the model assumes English, and every one of those assumptions is a small landmine.

The CSV export is the perfect specimen, because it’s so dumb and it bit real users. An operator asks a question, likes the answer, clicks download, opens the file in Excel — and Excel, especially on Windows, does not assume a CSV is UTF-8. With no explicit marker it falls back to the system locale’s encoding, reads the multi-byte Turkish characters as if each byte were its own character, and produces the shredded mess from the screenshot. The file was always valid UTF-8. Excel just refused to believe it.

The fix is two bytes:

# Excel on Windows won't assume UTF-8 without a byte-order mark, so it
# misreads Turkish characters (çalışan -> Ã§alÄ±ÅŸan). Prepending a BOM
# states the encoding explicitly, and the export opens correctly.
csv_bytes = "".encode("utf-8") + csv_text.encode("utf-8")

A byte-order mark at the front of the file. That’s the whole fix. It isn’t clever and it isn’t interesting, and finding it took an absurdly long time — because every tool I used (a Mac, a terminal, a reasonable editor) rendered the file perfectly. The bug only existed in the one place I wasn’t testing: a Turkish operator’s Windows laptop, opening Excel. The lesson isn’t about BOMs. It’s that localization bugs live in the gap between your environment and your user’s, and if you only ever look at your own screen, you will never once see them.

This is what I mean by ninety-five percent everything-else. “Thinks in Turkish” isn’t a capability you switch on at the model. It’s a property you have to carry through every layer: the prompts, the schema, the model’s output, the chart labels, the number and date formatting, the encoding on the file going out the door. Miss one layer and the whole thing feels broken even when the intelligence underneath is flawless — because users don’t grade your model, they grade the spreadsheet that opened wrong.

And the defaults fight you the whole way. The libraries default to English, the encodings default to whatever was convenient in California, every tutorial’s examples are pure ASCII. Building in a non-English language means noticing and overriding a long tail of these, each one individually trivial and collectively the actual job. I’ve started thinking of it as a tax you pay for operating outside the language the tools were designed for — invisible right up until a screenshot of Ã§alÄ±ÅŸan lands in your inbox.

If I were advising someone starting a non-English LLM product: budget for the tax, and test in your users’ environment from day one, not your own. The model will speak their language fine. It’s the plumbing around it that defaults to English, and the plumbing is where the experience is won or lost.¹

From a consulting project building natural-language analytics for restaurant businesses. Customer details and schema are abstracted; the reasoning is as built. Code is illustrative.

Footnotes

Turkish has a genuinely nasty trap for the kind of string matching I rely on elsewhere: the dotless ı and the dotted i are different letters, and in a Turkish locale "I".lower() is "ı", not "i". Case-insensitive comparison that’s correct in English silently does the wrong thing on Turkish text. Anywhere you canonicalize or match strings, the locale of your lower() call quietly decides whether your matching works.↩︎