Natural Language to SQL in 2026: what actually works
An opinionated look at the current state of NL-to-SQL: what schema introspection, dialect awareness, and execution loops actually need to deliver in production.
Two years ago, every demo of natural-language-to-SQL looked the same: a clean schema, three tables, a 'show me revenue by month' question, and a chart popping out. By 2026 the question is different. Real production data is messy — hundreds of tables, vague column names, JSONB fields, partitioned warehouses, and analysts who have learned to distrust anything that doesn't show its work. So which NL-to-SQL approaches actually hold up, and which ones still don't?
Schema introspection is non-negotiable
The single largest predictor of accuracy is how well the system reads your schema. Foreign keys, materialized views, JSONB shapes, partition columns — all of it. Systems that rely on the LLM's pretraining alone fail in production. Systems that pre-index every column, every relationship, every distinct value sample, succeed.
Practically, this means a good NL-to-SQL agent in 2026 should run a schema crawl on connect, store column statistics, and re-crawl on demand. Not a one-shot prompt with the schema dumped into context.
Dialect awareness is where most products quietly fail
Postgres' DATE_TRUNC isn't BigQuery's DATE_TRUNC. SQL Server's STRING_AGG isn't MySQL's GROUP_CONCAT. The vast majority of 'open the LLM and pray' implementations generate ANSI SQL that breaks at runtime. The fix is mechanical: tell the model the dialect explicitly, validate the SQL with the database's own parser before executing, and re-prompt if it fails.
The retry loop matters more than the model
GPT-class models in 2026 are good enough that the bottleneck isn't 'can the model write SQL' — it's 'what happens when the SQL fails'. The systems that work in production catch the error, feed it back to the model with the offending query, and retry. The systems that fail just throw the error at the user.
A good loop has three stops: (1) syntax check via the database, (2) sample-row sanity check (does the result look like what was asked?), (3) optional human-in-the-loop confirm for write operations.
Where it still doesn't work
- Vague schemas: a column named 'amount' with no comments, no foreign keys, no consistent format. The model guesses.
- Window functions over irregular partitions. Still error-prone.
- Multi-step business questions that humans would model as a CTE chain. The model often skips intermediate aggregations.
- Anything requiring tribal knowledge ('exclude internal accounts' — which accounts are 'internal'?). Without a semantic layer, the model can't know.
What we ship at Data Talks
We treat every connector as a first-class consumer of dialect rules. The agent reads schema on connect, statistics on demand, and runs a syntax-validate loop against the actual database before returning a result. Every answer ships with the raw SQL so analysts can audit. Where we don't yet match a Wren AI is on the semantic-layer side — that's roadmap.
The honest take in 2026: NL-to-SQL is production-ready for analytical questions over well-modeled data, with the right scaffolding around the LLM. It is not a replacement for analysts on poorly-modeled data. The gap closes every quarter, but it isn't zero yet.