Tabular Validator¶
The Tabular Validator validates tabular data — a table of typed rows — that a user submits, such as a Darwin Core occurrence export or a building energy meter CSV. Its primitive is the table, not the file format: CSV is simply the reader that ships first. TSV, Excel, and Parquet are planned as future readers in front of the same validation core, which is why the validator is called "Tabular" rather than "CSV".
The full design lives in ADR-2026-05-26 (docs/adr/2026-05-26-csv-validator.md
in the private validibot-project repo). This page documents the parts that are
implemented and how they fit together.
Status: usable end-to-end
The validator is configurable and runnable through the UI: the
reader/PREFLIGHT layer, native structured validation, row-stage CEL
assertions (the row.* compiled-once-per-run loop), the registered
TabularValidator, the full-screen Expected columns editor, and a
step-detail summary card are all in place. After applying migrations and
running manage.py sync_validators, an author can select Tabular
Validator, configure columns manually or through import/inference, add
row.* assertions, and run it. A row.<column> reference to a column not
declared in the step's schema is rejected at save time (the ADR's
column-existence obligation).
How a table becomes an in-memory model¶
Reading a submission happens in two phases that are deliberately kept apart, because conflating them hides real failures.
PREFLIGHT does the cheap checks that don't require loading the whole table. It rejects an oversized file by its byte size before anything is decoded, decodes the bytes (stripping a UTF-8 byte-order mark so it can't bleed into the first column name), resolves the delimiter, and peeks at just the first record to learn the column names and field count. PREFLIGHT never sees the body, so it cannot catch a ragged row buried on line 40,000 — that's the next phase's job. What PREFLIGHT guarantees is that reaching that failure is bounded: a 5 GB upload is turned away before it can exhaust the worker.
READ parses the body into a pandas dataframe. Every cell is read as a string with no type or "NaN" inference, so an empty cell is an empty string rather than a floating-point NaN, and parsing never depends on the host's locale. Two operators on two machines read the same file into the same values — which is what lets a downstream credential attest over a reproducible result. Parsing is strict: a ragged row or an unbalanced quote fails the read with a clear error rather than being silently repaired. This is a quality-first validator, not a best-effort cleaner.
The result is the shared in-memory model every later layer consumes: a
string-valued dataframe keyed by canonical column names, plus the row count.
The row count is len(df) — the number of parsed rows — which is not the same
as counting newlines, because a quoted field may legitimately contain a newline.
Delimiter resolution¶
The delimiter is decided, not guessed-and-hoped. If the settings declare one, that declaration is authoritative. If nothing is declared, the reader sniffs the delimiter from a bounded sample of the file. If a declaration and a sniff disagree, the read fails with a clear message rather than silently preferring one — an honest "you said comma, this looks tab-delimited" beats a wrong guess that would mis-parse every row.
Column names: one canonical name per column¶
Everything downstream — row.* keys, native column settings, the column-name
list exposed to assertions — uses a single logical name per column, resolved by
this precedence:
- Headered file → the header string, after trimming whitespace. Headers
must be safe to address: a blank name, a duplicate name, or two names that
collide ignoring case (
Latvslat) all fail by default, becauserow.*keys come from these names and an ambiguous header would make a reference silently wrong. - Headerless file with a declared name for that position → the declared name (aligned by position).
- Headerless file with no declared name → a synthesised
column_1,column_2, … (1-based).
So column_N is the headerless default, not a competing identity: declare a
name and it replaces the default. A headerless file is a first-class input —
the row namespace is always populated; the only difference is where the names
come from.
Native structured validation¶
The structured constraints a non-coder declares in the settings — required columns, per-column type, numeric range, string length, regex, enum, and uniqueness — are checked natively against the dataframe. They are never compiled to CEL; CEL is the separate lane for cross-field and conditional logic the settings can't express. The two lanes share one finding stream.
The settings are stored as a Frictionless Table
Schema descriptor. Validibot
adopts that vocabulary — fields with type and constraints, plus
primaryKey — but does not depend on the frictionless library; a descriptor
is parsed into a small internal schema model. An unrecognised type degrades to
string so an imported descriptor still loads.
Where the standard is documented. Table Schema's current canonical home is
datapackage.org (the linked specs.frictionlessdata.io
page is the v1 spec it grew from); both document the same
fields/type/constraints vocabulary we consume, so either is a valid
reference for authoring a descriptor. For why we adopt the vocabulary rather
than the frictionless processor — and why the National Archives CSV Schema
Language and W3C CSVW were considered and set aside — see ADR-2026-05-26,
section Standards alignment: Frictionless Table Schema and its Alternatives
considered subsection (docs/adr/2026-05-26-csv-validator.md in the private
validibot-project repo).
There are three ways to populate that descriptor: edit columns directly,
paste/import an existing descriptor, or infer one from a sample file (the
fastest common path — most users have a CSV, not a hand-written descriptor).
Import and inference populate the same ordered Django formset used for manual
editing. Inference (infer.py) reads a bounded sample, resolves the dialect and
column names through the normal reader, and guesses each column's type from its
values using the same coercion the validator enforces (so an inferred type
means what validation will check).
Candidate order is deliberate — integer before boolean (so 0/1 reads as
integer), and a column whose values don't all fit one type stays string.
Inference produces a starting point the author tightens: it picks types only,
never invents min/max/enum constraints.
Each cell is coerced from its string form to the declared type, deterministically
and locale-free: an empty cell is null, "1,000" is not a number, "5.0" is
not an integer, and dates are ISO 8601 only. This is the same coercion the
row-stage CEL layer will use, so the two never disagree about what a cell is.
Findings follow the reporting shape the whole validator uses: one finding per
failed check (per column), carrying the total failure count and up to
report_max_examples sample row numbers (10 by default) — never one finding per
failing row, so
a column with a million bad cells produces one readable finding, not a million.
Uniqueness semantics¶
Uniqueness is a native V1 check (single unique columns and single/composite
primaryKey), computed over the canonical typed values — so "1" and "1.0"
are the same key in a numeric column. The null rules are pinned:
uniquefollows SQL: nulls are exempt. Multiple missing values do not collide; only repeated non-null values are a violation.primaryKeyis unique and non-null. A null in any key component is its own violation (tabular.primary_key_null), independent of duplicate detection; a composite key is keyed on the tuple of its parts.
A note on blank lines¶
A wholly blank line (a trailing newline, an editor artifact) is skipped rather
than counted as an all-null data row — so it neither inflates num_rows nor
produces spurious "required value missing" findings. An empty field (,2) is
different: it is kept as a null cell and is subject to the required rule.
The validator¶
TabularValidator ties the pieces together, mirroring how the JSON Schema
validator works. Its configuration lives on the ruleset: ruleset.rules
(rules_text/rules_file) holds the Table Schema descriptor (the
structured column config), and ruleset.metadata holds the dialect
(delimiter, has_header, quotechar) and report_max_examples.
A run does six things: load the schema (a schema that won't parse is a
tabular.invalid_schema finding, not a crash); read the body (a read failure
becomes a finding carrying its tabular.* code); run native validation, mapping
each NativeFinding onto a platform ValidationIssue with the count and sample
rows preserved in meta; run the row-stage CEL loop; run the
column-stage CEL aggregate assertions; and run the standard dataset (i.*)
/ output CEL assertion lane. The i.* dataset signals
(num_rows, column_names, delimiter, …) are exposed for those assertions and
returned for downstream steps.
The validator registers through the normal auto-discovery path: its config.py
declares a ValidatorConfig for ValidationType.TABULAR, and sync_validators
creates the DB row. CSV is carried as a plain-text submission
(SubmissionDataFormat.CSV).
Row-stage CEL assertions¶
Cross-field and conditional row rules — row.minimumDepthInMeters <=
row.maximumDepthInMeters, row.eventDate <= now() — are CEL assertions
evaluated per row. They are stored as RulesetAssertion rows tagged
options["tabular_stage"] == "row"; the generic assertion lane skips those
(a row.*-only expression would otherwise misclassify and evaluate against an
unbound context), so the validator owns their evaluation.
Authoring. The assertion form accepts the row.* namespace only on a
tabular step (it's rejected elsewhere as an unknown identifier), and on save
it derives the stage from the expression: a row.* reference tags the assertion
tabular_stage="row", a col.* reference tags it "column", and anything else
(i.*/s.*) tags it "dataset".
The form also checks that every row.<column> (dot or bracket) names a column
declared in the step's stored schema, rejecting a typo'd or absent column at
save time (skipped only when no schema is configured yet).
The engine (row_eval.py) follows the ADR's performance strategy: each
assertion's program is compiled once per run, then evaluated against every
row with no per-eval thread. Cost is bounded by the row/assertion caps, the
author-time expression-shape limits (checked at compile), and a single
wall-clock budget for the whole pass. Each row's values are bound into row.*
as their typed celpy values (via the shared coercion.py), alongside s.*
(workflow signals) and i.* (dataset metadata). now() is the pinned run
clock (run.started_at), not the wall clock — so a "not in the future" check
is reproducible for the run.
The determinism rule the engine enforces: an assertion that evaluates to null
or raises is a failure with a distinct code (tabular.assertion_null /
tabular.assertion_error), never a silent pass — a null cell must not satisfy a
comparison by comparing as null. A false result is the ordinary rule violation
(tabular.row_assertion_failed). Each assertion produces at most one finding per
outcome class, with a count and capped sample rows, so a million-row failure is
one readable finding.
Column-stage CEL assertions¶
Column assertions run once after row validation against col.*. Each declared
column exposes aggregates computed from the same canonical typed values used by
native and row validation:
distinct_count,null_count,non_null_count, andnull_ratiominandmaxsumforintegerandnumbercolumns
For example, col.depth.null_ratio < 0.05 limits missing depth values, while
col.temperature.max <= 60 caps the observed range. Empty cells and coercion
errors count as null. An optional declared column that is absent from the file
still has a stable empty aggregate, while actual column presence remains
available through i.column_names.
Column references are checked against the saved schema at author and import
time. Unknown aggregate names and sum on non-numeric columns are rejected
before the assertion is persisted. Null, non-boolean, and evaluation-error
results fail explicitly, matching the row-stage determinism contract.
Configuring a step (the settings editor)¶
A tabular step is configured on the normal full-screen step settings page
(steps/<id>/settings/). TabularStepConfigForm owns the ordinary step and
dialect fields plus an ordered TabularColumnFormSet. Each column form exposes
the Table Schema field vocabulary: name/type, required, unique, primary-key
membership, numeric bounds, string length/pattern, and enum values. A narrow
Validibot extension, x-validibot-requiredWhenPresent, backs the no-CEL
Required when another column exists control. Encoding is
pinned to UTF-8 in V1 (not an editable field):
submitted content reaches the validator already decoded as UTF-8, so a per-step
encoding setting could only silently corrupt non-UTF-8 input — honoring other
encodings needs a raw-bytes read path, a future slice. On save,
build_tabular_config writes the descriptor to ruleset.rules_text and the
dialect to ruleset.metadata — the exact places the validator reads back at run
time, so the editor and the engine meet at the ruleset.
build_tabular_descriptor() converts the cleaned formset back into a
descriptor. It replaces the keys the editor owns while preserving unknown
top-level, field-level, and constraint-level metadata from an imported
descriptor. This prevents an edit to one range from stripping titles or
extension keys the UI does not expose. Primary-key checkboxes are serialized
in form order, and primary-key membership implies required=true.
Four HTMx endpoints support the authoring flow:
tabular/columns/returns a correctly prefixed formset row and updatesTOTAL_FORMSout of band.tabular/schema/import/validates pasted or uploaded JSON and returns a compatibility report plus a replacement preview.tabular/schema/infer/reads the bounded upload, returns a replacement preview, and updates resolved dialect controls out of band.tabular/schema/apply/validates the preview payload again and replaces the workspace only after explicit author confirmation.
The endpoints return server-rendered partials and never create temporary database records. Invalid import/inference responses bind the posted formset back into the replacement workspace, so unsaved column edits survive. The ordinary POST path still accepts a pasted descriptor or sample upload without HTMx, preserving progressive enhancement.
The formset enables Django's native ordering field. Up/down buttons reorder the
DOM and rewrite the hidden ORDER values; build_tabular_descriptor() uses
ordered_forms, so headerless-file position and composite-key order survive
the POST. Added rows receive focus after the HTMx swap. Request buttons use
hx-disabled-elt, hx-indicator, and hx-sync to prevent duplicate or
competing replacements.
Import compatibility is intentionally explicit. schema.py reports preserved
but unenforced features including foreignKeys, custom missingValues,
locale-specific number parsing, formats, unsupported scalar types, and unknown
standard constraints. The same notices appear in HTMx previews and as Django
messages on the no-JavaScript save path.
Saved descriptors can be downloaded from tabular/schema/export/. Encoding is
fixed to UTF-8 in V1 because the current submission API supplies decoded text;
presenting a selectable encoding before a raw-byte path exists would create a
setting the validator cannot honor.
The step-detail page shows a read-only summary card (reader, delimiter,
header, total/required columns, and dataset/row/column assertion counts) with
an "Edit settings" link. The existing assertion surface renders Tabular
assertions in stage-specific groups. Dataset and row groups each have a scoped
Add action, and the column group provides the same scoped flow for col.*.
A global Add action first asks which execution stage the rule belongs to.
The CEL editor provides stage-aware namespace hints and schema-derived
completions; row assertions also expose a per-assertion example-row limit.
Assertion cards prefer the optional human-readable description and otherwise
show the CEL expression itself. This fallback reads the authoritative
rhs["expr"] payload, so legacy or imported assertions remain identifiable
even when their compatibility target_data_path field is blank.
Canceling Tabular settings returns to the workflow editor and focuses the
originating step card.
Limits¶
"Human-scale, in-memory" is only a real contract if it's bounded, so the reader enforces caps with safe, deployment-tunable defaults. Exceeding a cap fails the run with a clear finding; it never silently truncates.
| Cap | Default | Enforced at |
|---|---|---|
| File size | 50 MB | PREFLIGHT (before the load) |
| Columns | 1,024 | PREFLIGHT (from the first record) |
| Rows | 1,000,000 | READ |
Where the code lives¶
| Module | Responsibility |
|---|---|
validations/validators/tabular/preflight.py |
PREFLIGHT: size/encoding/dialect/first-record checks, the TabularDialect and TabularLimits settings, and the tabular.* error codes |
validations/validators/tabular/readers/csv.py |
The CSV reader: column-name resolution, strict load into a dataframe, the row cap |
validations/validators/tabular/schema.py |
TabularSchema / FieldSpec model and the Table Schema descriptor parser |
validations/validators/tabular/infer.py |
Schema inference: read a sample, guess column types, return a descriptor + resolved dialect |
validations/validators/tabular/coercion.py |
Deterministic, locale-free coercion of a string cell to its declared type (shared with row-stage CEL) |
validations/validators/tabular/native.py |
Native structured validation: produces NativeFindings for required/type/range/length/pattern/enum/uniqueness checks |
validations/validators/tabular/row_eval.py |
Row-stage CEL engine: compile-once-per-run, evaluate per row, typed row.* binding, null/error-as-failure, wall-clock budget |
validations/validators/tabular/column_eval.py |
Column-stage CEL engine: deterministic typed aggregates and one-shot col.* evaluation |
validations/validators/tabular/validator.py |
TabularValidator: reads the submission, runs native + row + column validation, maps findings to ValidationIssue, runs the dataset/output CEL lane |
validations/validators/tabular/config.py |
The ValidatorConfig that makes the validator discoverable and DB-syncable |
workflows/forms.py (TabularStepConfigForm, TabularColumnFormSet) |
Step settings, column validation, and metadata-preserving descriptor serialization |
workflows/views/steps.py (Tabular*View) |
HTMx row/import/inference endpoints and settings-page context |
workflows/views_helpers.py (build_tabular_config) |
Writes the descriptor to ruleset.rules_text and dialect to ruleset.metadata on save |
workflows/step_configs.py (TabularStepConfig) |
Typed display config for the step-detail summary card |
templates/workflows/partials/tabular_* |
Full-screen settings sections, schema workspace, and column-card partials |
static/src/ts/tabularSchema.ts |
Type-aware constraint visibility, deletion, and live column count |
templates/.../components/tabular_config_card.html |
The read-only step-detail summary card |
Read failures are raised as a TabularReadError carrying a machine-readable
code (always prefixed tabular., e.g. tabular.file_too_large,
tabular.parse_error). Native validation returns NativeFindings carrying the
same kind of tabular.* code (e.g. tabular.out_of_range,
tabular.unique_violation) plus a count and sample rows, so the validator can
emit structured findings without matching on message text.
Import and export¶
The Tabular Validator is the one validator that ships a custom step serializer
for workflow import/export. Its row and column
assertions may only reference columns declared in the Table Schema — a rule the
step-editor form enforces but import bypasses — so
TabularStepSerializer.validate_imported_ruleset re-applies the check on import,
raising vaf.tabular_unknown_column for an undeclared reference. Every other
inline validator uses the generic base serializer unchanged.