Schema-aware validation: how to keep an LLM from hallucinating SQL

The first time it happened, the query ran.

I had asked a natural-language question against a real warehouse — not the Chinook sample, a catalog with dozens of tables — and the model returned SQL that referenced Invoice.BillingAddress. Plausible name. The table exists; the column does not. The database did not throw. The result set was wrong in a quiet way: fewer rows than expected, shaped like an answer, useless for the decision I was trying to make.

That failure mode is what schema-aware validation is for. Not prettier error messages from Postgres. Not asking the model to try harder. A deterministic pass over the generated SQL before anything executes, grounded in the same introspected catalog the runtime already trusts.

Why “trust the model” and “catch DB errors” both fail

The naive pipeline is familiar: prompt the model, run the SQL, show errors if the database complains.

It breaks in two different directions.

Plausible hallucinations. Models are good at names that look like they belong. Customer.Email when the column is EmailAddress. A join key that exists on a different table in this schema. Syntax is valid; identifiers are wrong. Many engines will still execute something.

Silent wrong answers. When the hallucinated column does not appear in the query at all — wrong filter, wrong join — you may get rows back. No exception. The UI looks fine. The analyst loses an afternoon.

Demo databases hide this. Five tables, clean foreign keys, every example in the training distribution. Production catalogs have naming collisions, legacy tables, and fifty entities that never appeared in a blog post about text-to-SQL. The gap between a slick chat interface and something you would attach to a real warehouse is mostly engineering around the model, not a larger context window.

Asking the model to self-check has the same problem in a different costume. Another stochastic pass, no guarantee it uses the same catalog snapshot, no structured issue list you can test in CI.

Executing and catching errors is better than nothing, but it optimizes for the failure mode you can see. The one you cannot see is worse.

Parse, walk, check — before execution

The approach in Lumen is boring on purpose.

Parse the candidate SQL with sqlglot into an AST.
Walk scopes with sqlglot’s scope analysis — each SELECT, including CTEs and subqueries, is its own validation unit.
For each scope, check physical table names against the introspected schema.
For each column reference, check against the columns available in that scope (including columns projected from subqueries and CTEs when they can be inferred).
Collect issues into a structured list. If any error-severity issue remains, do not execute.

Why parse-and-check beats ask-the-model-to-check: the catalog is the authority, the walk is deterministic, and the same SQL either passes or fails the same way in the CLI, in tests, and in CI. You can unit-test the validator without calling an API.

Why it beats execute-and-catch: you never pay warehouse round-trips for SQL that was doomed from identifier errors, and you avoid returning rows built on a misread column list. The user sees validation issues before a silent wrong answer.

The validator returns a small result type:

class ValidationIssue(BaseModel):
    severity: Literal["error", "warning"]
    code: Literal[
        "syntax_error",
        "unknown_table",
        "unknown_column",
        "ambiguous_column",
        "unqualified_column",
    ]
    message: str
    table: str | None = None
    column: str | None = None


class ValidationResult(BaseModel):
    valid: bool
    issues: list[ValidationIssue]
    parsed_sql: str | None

valid is false when any error-severity issue exists. Warnings exist for cases like unqualified columns in multi-table scopes where the engine might still disambiguate — errors are what block execution.

The hard parts (including a real bug)

Alias resolution is the work. A column might refer to a physical table, a subquery alias, or a CTE. The validator expands SELECT * from a single physical table when it can; derived scopes get their output columns from the inner query walk. When the inner query projects expressions without aliases, outer references fail validation until the model names them — which is intentional pressure toward explicit SQL.

CTEs and nested subqueries each get a scope. UNION branches are handled as separate selects. None of this is magic; it follows sqlglot’s scope tree and a lot of unit tests on Chinook-shaped SQL.

The bug that actually hurt was subtler. For a predicate like:

SELECT * FROM Customer
WHERE CustomerId IN (SELECT CustomerId FROM Invoice)

sqlglot attaches column nodes to scopes. Early on, filtering columns with scope.local_columns for validation was wrong: a column inside the IN (SELECT …) subquery was treated as if it belonged to the outer SELECT, so valid inner references failed and some outer references looked valid when they were not.

I caught it on a Chinook-shaped integration test, not in production — a CustomerId inside an IN (SELECT …) marked unknown while the same column outside the subquery validated fine. That kind of asymmetry is a signal the scope walk is wrong, not that the catalog is. The fix was to attribute each column to its nearest containing Select in the AST, then only validate columns against the scope that owns that select:

def _nearest_select(node: exp.Expr | None) -> exp.Select | None:
    n: exp.Expr | None = node
    while n and not isinstance(n, exp.Select):
        n = n.parent
    return n if isinstance(n, exp.Select) else None


def _own_columns(scope: Scope) -> list[exp.Column]:
    sel = scope.expression
    if not isinstance(sel, exp.Select):
        return []
    return [c for c in scope.columns if _nearest_select(c) is sel]

That heuristic matches the Chinook-style patterns in the test suite. It is not a formal proof about every dialect edge case — highly correlated subqueries can still blur ownership — but it turned a class of false positives and false negatives into something regressible. The unit tests under tests/unit/test_validator.py are the contract; sqlglot upgrades get run against them.

Physical tables are checked first within each scope: if the base name is not in the introspected schema, you get an unknown_table issue and the walk continues so you collect more than one error per statement when useful. Derived tables — subqueries in the FROM clause — are validated by walking the inner scope first, then advertising the inner select list to the outer scope. When the inner query uses SELECT * from a single table, expansion is straightforward. When it does not, the outer scope only sees columns that were named explicitly. That nudges generated SQL toward the kind of explicit projections you would ask a human analyst to write.

Dialect strings flow through the same path as generation. The validator parses with sqlglot configured for the warehouse dialect (Postgres, DuckDB-backed SQLite, and so on) so the AST matches what the generator was prompted for. Parse failures become a single syntax_error issue; there is no partial validation on a broken tree.

Corrective retry, bounded

Validation is not only a gate. When the model’s first SQL fails, Lumen can re-prompt with the rejected statement and the issue list — including, for unknown columns, the available names on that table so the correction is grounded, not guessed.

The correction system message stays minimal. The user block repeats the original question, the invalid SQL, and lines like:

- [unknown_column] Column 'BillingAddress' is not available on table 'Invoice'.
  Available columns: InvoiceId, CustomerId, InvoiceDate, ...

Default policy allows two retries after the initial generation (three generations total). Enough to fix a join mistake and a typo; not an infinite loop burning tokens. Each retry is a full generation call with the same schema context repeated — expensive, which is why the cap stays low. The CLI exits non-zero when validation still fails after the last attempt unless you pass a debug flag to inspect the broken SQL.

Read-only enforcement is separate and non-negotiable: if the AST root is not a SELECT or UNION of selects, validation fails with a clear message. INSERT, UPDATE, DELETE, and DDL never reach the warehouse through this path.

What this catches — and what it does not

Catches: hallucinated table and column identifiers relative to the introspected catalog; many syntax problems via parse failure; non-read-only statement shapes before execution.

Does not catch: semantically wrong but syntactically valid SQL. The model can choose the wrong join that still references real columns, apply an incorrect filter on a real field, or aggregate the wrong metric while every identifier resolves. Fixing that class of error needs semantic constraints, eval harnesses, and human review — not a catalog walk alone.

I would rather say that plainly than imply the validator makes NL-to-SQL “safe.” It removes a large, common failure mode — the plausible wrong name — before it costs money or trust.

Quantitative effectiveness on benchmark suites is still being finalized for the project site. Qualitatively: in development, the hallucinated-identifier class of failures is routinely caught before execution, which is the problem this layer was built for.

Where this sits in the request path

In Lumen’s CLI and API, validation is not a hidden side effect. The user sees interpretation before generation, then SQL, then either a validation failure with issues listed or execution. The HTTP API exposes the same pipeline; the Angular UI surfaces validation state on the Ask flow. That visibility matters: when validation fails, the operator knows the catalog rejected the statement — not that the database is down, not that the question was impossible, but that the proposed SQL did not match the schema snapshot in use.

Keeping the schema snapshot aligned with the warehouse is a separate operational concern. Introspection runs when you connect; the semantic YAML is validated against that catalog at startup. If someone adds a column in production and does not refresh introspection, the validator is only as current as the catalog you gave it. That is not a flaw in parse-and-check; it is a reminder that validation is about a specific, versioned view of the warehouse, which is still better than no view at all.

The model is a component

The reliability of a natural-language analytics tool does not come from the model trying harder. It comes from the ordinary software around it: a semantic layer that names metrics without guessing joins, interpretation shown before execution, a parser-backed validator tied to the live catalog, bounded retries with structured feedback, and tests that fail when scope attribution regresses.

The model proposes SQL. The system decides whether that SQL is allowed to run. That division of labor is not a branding line — it is the only way I have found to keep a demo from becoming a liability on a real schema.