Validator Containers (Advanced Validators)¶
Validator containers run advanced validations like EnergyPlus simulations and FMU execution. They support two deployment modes:
- GCP Cloud Run Jobs (production): Async execution with GCS storage and callbacks
- Docker Compose (Docker): Sync execution with local filesystem storage
GCP Mode: Cloud Run Jobs (web/worker split)¶
We deploy one Django image as two Cloud Run services:
- $GCP_APP_NAME-web (
APP_ROLE=web): Public UI + public API. - $GCP_APP_NAME-worker (
APP_ROLE=worker): Private/IAM-only internal API (callbacks).
Validator jobs (EnergyPlus, FMU, etc.) run as Cloud Run Jobs and call back to the worker service using Google-signed ID tokens (audience = callback URL). No shared secrets.
In environments with a custom public domain (production), SITE_URL points at the public domain (for example https://validibot.com) while WORKER_URL points at the worker service *.run.app URL. Callbacks and scheduled tasks should always target WORKER_URL, never SITE_URL.
Flow overview¶
sequenceDiagram
participant Web as web (APP_ROLE=web)
participant Worker as worker (APP_ROLE=worker)
participant JobsAPI as Cloud Run Jobs API
participant Job as Validator Job (SA)
Web->>JobsAPI: jobs.run (web SA)
JobsAPI-->>Job: Start job with env VALIDIBOT_INPUT_URI
Job->>Job: Download input.json + files from GCS
Job->>Worker: Callback with result_uri (ID token from job SA)
Worker->>Worker: Verify ID token + persist results
IAM roles involved:
- Web/Worker service account (
$GCP_APP_NAME-cloudrun-{stage}): Customvalidibot_job_runnerrole on the validator job so Django can call the Jobs API with overrides (forVALIDIBOT_INPUT_URIenv var). This role includesrun.jobs.runandrun.jobs.runWithOverridespermissions. - Validator job service account (
$GCP_APP_NAME-validator-{stage}): Dedicated least-privilege SA used by validator Cloud Run Jobs. Hasroles/storage.objectAdminon the stage bucket (read inputs, write outputs) androles/run.invokeron$GCP_APP_NAME-worker(for callbacks). Does not have access to secrets, Cloud SQL, Cloud Tasks, or KMS. - Worker: private, only allows authenticated calls; rejects callbacks on web.
Custom IAM Role¶
The standard roles/run.invoker role only includes run.jobs.run, but triggering jobs with environment variable overrides (like VALIDIBOT_INPUT_URI) requires run.jobs.runWithOverrides. We use a project-level custom role:
# Role: projects/$GCP_PROJECT_ID/roles/validibot_job_runner
# Permissions: run.jobs.run, run.jobs.runWithOverrides
This role is automatically granted by the just gcp validator-deploy command.
Why env + GCS pointer: Cloud Run Jobs only accept per-run overrides via env/command; we keep large envelopes in GCS and pass a small VALIDIBOT_INPUT_URI env so the request stays small and the job can fetch full inputs at runtime.
Status tracking: We record the Cloud Run execution name and a job_status using CloudRunJobStatus (PENDING/RUNNING/SUCCEEDED/FAILED/CANCELLED) in launch stats for observability and fallback polling; run/step lifecycle still uses ValidationRunStatus/StepStatus.
Why we use a callback_id in addition to run_id¶
Cloud Run retries callbacks if delivery fails. The run ID tells us which resource to update, but it does not distinguish one delivery attempt from another. Without a per-callback token we would reapply findings and status every time the platform retries, or we would have to drop all later callbacks for that run.
The launcher generates a unique callback_id for each job execution and puts it into the input envelope. The validator echoes it back in the callback. The worker uses that ID to fence retries: the first delivery creates a receipt; any repeat with the same callback_id returns immediately as a replay. This lets us ignore duplicate deliveries while still accepting legitimate future callbacks for the same run (for example, another step or a rerun).
Deployment steps¶
- Build/push Django image (same for web/worker)
- Deploy web:
--allow-unauthenticated--set-env-vars APP_ROLE=web- Deploy worker:
--no-allow-unauthenticated--set-env-vars APP_ROLE=worker- Set
WORKER_URLin the stage env file to the worker service URL (see below) -
Grant
roles/run.invokeron$GCP_APP_NAME-workerto each validator job service account -
Validator jobs:
- Tag with labels:
validator=<name>,version=<git_sha> - Env:
VALIDATOR_VERSION=<git_sha> - Callback client mints an ID token via metadata server; Django callback view 404s on non-worker.
To populate WORKER_URL for a stage, fetch the worker service URL and add it to the stage env file:
# prod example
gcloud run services describe $GCP_APP_NAME-worker \
--region $GCP_REGION \
--project $GCP_PROJECT_ID \
--format='value(status.url)'
Then update your env file (.envs/.production/.google-cloud/.django), run just gcp secrets prod, and redeploy.
Deploying Validator Jobs¶
Validator containers can be deployed using either justfile. Both are equivalent and kept in sync:
| Location | Command | Notes |
|---|---|---|
validibot/ |
just gcp validator-deploy energyplus dev |
Uses symlink to validibot_validators_dev/ |
validibot_validators/ |
just deploy energyplus dev |
Native repo, shorter command |
Quick deploy¶
# From validibot_validators directory
just deploy energyplus dev # Deploy EnergyPlus to dev
just deploy fmu prod # Deploy FMU to prod
just deploy-all dev # Deploy all validators to dev
# From validibot directory
just gcp validator-deploy energyplus dev
just gcp validators-deploy-all dev
What the deploy command does¶
- Builds the container image with
linux/amd64platform (required by Cloud Run) - Pushes to Artifact Registry at
$GCP_REGION-docker.pkg.dev/project-xxx/validibot/ - Deploys the Cloud Run Job with:
- Stage-appropriate job name (
$GCP_APP_NAME-validator-energyplus-devfor dev,$GCP_APP_NAME-validator-energyplusfor prod) - Dedicated validator service account (
$GCP_APP_NAME-validator-dev@...for dev) - Memory (4Gi), CPU (2), timeout (1 hour), no retries
- Labels for tracking (
validator=energyplus,stage=dev,version=abc123) - Grants IAM permissions:
- Adds custom
validibot_job_runnerrole to the main SA so the web/worker service can trigger the job with env overrides - Grants
roles/run.invokeron the worker service to the validator SA so the job can POST callbacks
Viewing logs and job status¶
# From validibot_validators directory
just list-jobs # List all validator jobs
just describe-job energyplus dev # Show job details
just logs energyplus dev # View recent logs
# From validibot directory (equivalent)
gcloud run jobs list --filter "name~$GCP_APP_NAME-validator" --region $GCP_REGION
Deleting a validator job¶
Multi-Environment Architecture¶
Validator containers are stage-agnostic: the same container image is deployed to dev, staging, and prod. All stage-specific configuration is passed at runtime, not build time.
What's baked into the container (build time)¶
Nothing stage-specific. The container includes:
- EnergyPlus binary (or FMU runtime)
- Python dependencies
- Validator code
What's passed at runtime (job execution)¶
When Django triggers a validator Cloud Run Job execution, it passes:
| Source | Data | Example |
|---|---|---|
VALIDIBOT_INPUT_URI env var |
GCS path to input envelope | gs://$GCP_APP_NAME-storage-dev/private/runs/run456/input.json |
| Input envelope | context.callback_url |
https://$GCP_APP_NAME-worker-dev-xxx.run.app/api/v1/validation-callbacks/ |
| Input envelope | context.execution_bundle_uri |
gs://$GCP_APP_NAME-storage-dev/private/runs/run456/ |
| Input envelope | Input file URIs (IDF, EPW, etc.) | gs://$GCP_APP_NAME-storage-dev/private/runs/run456/model.idf |
The validator reads the input envelope, downloads files from the provided GCS URIs, runs the simulation, uploads outputs to the execution bundle URI, and POSTs results to the callback URL.
Stage isolation¶
Stage isolation is enforced by:
- Django creates envelopes with stage-appropriate bucket names and callback URLs
- Service accounts - each stage's validator job uses a stage-specific SA that only has access to its own buckets
- GCS bucket permissions -
$GCP_APP_NAME-cloudrun-devcan only access$GCP_APP_NAME-storage-dev, not prod buckets
Deploy-time environment variables¶
The only env vars set at deploy time are for observability:
VALIDATOR_VERSION=<git_sha> # For version tracking in logs
VALIDIBOT_STAGE=dev # For log filtering (doesn't affect behavior)
These don't affect which buckets or URLs are used - that's all driven by the input envelope.
Implications¶
- One build, deploy everywhere: Build once with
just validator-build energyplus, then deploy to any stage - No secrets in containers: Validators use ADC (Application Default Credentials) from the attached service account
- Safe rollbacks: Rolling back a validator version doesn't affect stage isolation
Lost Callback Recovery¶
If a Cloud Run Job completes but its callback never reaches Django (network failure, container crash before POST, Cloud Run retry exhaustion), the run gets stuck in RUNNING status. The cleanup_stuck_runs management command handles this:
- The command runs every 10 minutes via Cloud Scheduler
- It finds runs stuck in
RUNNINGpast the timeout threshold (default: 30 minutes) - For GCP runs, it looks for
execution_nameinstep_run.output(stored by the launcher at job trigger time) - It queries the Cloud Run Jobs API to check the actual execution status
- If the job succeeded, it constructs a synthetic callback and processes it through
ValidationCallbackService, recovering the full validation results - If the job failed, it marks the run as
FAILEDwith the Cloud Run error - If the job is still running, it skips the run (no false timeout)
Where execution metadata is stored¶
When the GCP backend triggers a Cloud Run Job, the launcher stores execution metadata in result.stats:
stats = {
"job_status": "PENDING",
"job_name": "validibot-validator-energyplus",
"execution_name": "projects/p/locations/r/jobs/j/executions/e",
"input_uri": "gs://bucket/runs/org/run-id/input.json",
"execution_bundle_uri": "gs://bucket/runs/org/run-id",
}
The step orchestrator persists these stats into step_run.output, making them available for later reconciliation.
The output envelope URI is derived from execution_bundle_uri + "/output.json".
Local vs cloud storage¶
- Cloud: GCS URIs for envelopes/artifacts.
- Local dev/test: file system paths under
MEDIA_ROOT(no GCS required).
Error handling¶
- Containers log all errors; fatal errors are optionally sent to Sentry if configured.
- User-facing messages stay minimal; detailed context stays in logs/Sentry.
- To inspect logs: open Cloud Logging and filter on
resource.type="cloud_run_job"andresource.labels.job_namematching the validator. Fatal errors will include stack traces. If Sentry DSN is present in the container,report_fatalwill forward the exception there. (Sentry bootstrap for validator containers is planned; for now, errors always land in Cloud Logging.)
Docker Compose Mode: Docker Runner¶
For Docker Compose deployments (single-server, VPS, on-premise), validators run as Docker containers executed synchronously by the Celery worker.
How it works¶
sequenceDiagram
participant Worker as Celery Worker
participant Storage as Local Storage
participant Docker as Docker Daemon
participant Container as Validator Container
Worker->>Storage: Write input.json (file://)
Worker->>Docker: Run container (sync)
Docker->>Container: Start with VALIDIBOT_INPUT_URI
Container->>Storage: Read input.json
Container->>Container: Run validation
Container->>Storage: Write output.json
Container-->>Docker: Exit (code 0)
Docker-->>Worker: Container completed
Worker->>Storage: Read output.json
Worker->>Worker: Process results
Key differences from GCP mode:
- Synchronous execution: Worker blocks until container exits
- Local filesystem: Uses
file://URIs instead ofgs:// - No callbacks: Results are read directly from storage after container exits
- Docker socket: Worker needs access to
/var/run/docker.sock
Configuration¶
Configure the Docker runner in Django settings:
# In settings or environment
VALIDATOR_RUNNER = "docker"
VALIDATOR_RUNNER_OPTIONS = {
"memory_limit": "4g", # Container memory limit
"cpu_limit": "2.0", # CPU limit (cores)
"network": "validibot", # Docker network for container
"timeout_seconds": 3600, # Max execution time (1 hour)
}
For Docker Compose deployments, also configure storage volume sharing:
# Storage volume (for Docker-in-Docker scenarios)
VALIDATOR_STORAGE_VOLUME = "validibot_local_storage"
VALIDATOR_STORAGE_MOUNT_PATH = "/app/storage"
DATA_STORAGE_ROOT = "/app/storage/private"
Environment Variables¶
All runners pass these standardized environment variables to containers:
| Variable | Description |
|---|---|
VALIDIBOT_INPUT_URI |
URI to input envelope (file:// or gs://) |
VALIDIBOT_OUTPUT_URI |
URI for output envelope (file:// or gs://) |
VALIDIBOT_RUN_ID |
Validation run ID (for logging and labeling) |
Building Containers for Docker Compose¶
# From validibot_validators directory
just build energyplus
just build fmu
just build-all
# Images are available locally as:
# $GCP_APP_NAME-validator-energyplus:latest
# $GCP_APP_NAME-validator-fmu:latest
Docker Compose Configuration¶
For local development with Docker Compose:
# docker-compose.local.yml
volumes:
validibot_local_storage: # Shared storage for validation files
services:
django:
volumes:
# Docker socket for spawning validator containers
- /var/run/docker.sock:/var/run/docker.sock
# Shared storage volume
- validibot_local_storage:/app/storage
environment:
- VALIDATOR_RUNNER=docker
- VALIDATOR_NETWORK=validibot_validibot
- VALIDATOR_STORAGE_VOLUME=validibot_validibot_local_storage
- VALIDATOR_STORAGE_MOUNT_PATH=/app/storage
- DATA_STORAGE_ROOT=/app/storage/private
Validator Container Contract¶
Validator containers must support both storage backends:
- Accept input URI via
VALIDIBOT_INPUT_URIenvironment variable - Support
file://URIs in addition togs://URIs - Write output to the URI specified by
VALIDIBOT_OUTPUT_URIor derived from execution bundle - Skip callbacks when
skip_callbackis set (sync mode doesn't need them)
See validibot_validators/validators/core/storage_client.py for the implementation that handles both URI schemes.
Advanced Validator Management¶
Enable advanced validators by listing container images in settings:
# Environment variable
ADVANCED_VALIDATOR_IMAGES=ghcr.io/validibot/energyplus:24.2.0,ghcr.io/validibot/fmu:0.9.0
Then sync validators from container metadata:
# Sync from configured images (reads metadata from Docker labels)
python manage.py sync_validators
# Preview without creating (dry run)
python manage.py sync_validators --dry-run
# Sync specific image (ignores ADVANCED_VALIDATOR_IMAGES)
python manage.py sync_validators --image ghcr.io/validibot/energyplus:24.2.0
# Skip pulling images (if already present locally)
python manage.py sync_validators --no-pull
The command reads metadata from Docker labels (org.validibot.validator.metadata) and creates/updates Validator records. Validators removed from the settings are soft-deleted (set to DRAFT state).
Container Cleanup (Ryuk Pattern)¶
The Docker runner labels all spawned containers for robust cleanup:
| Label | Purpose |
|---|---|
org.validibot.managed |
Identifies Validibot containers |
org.validibot.run_id |
Validation run ID |
org.validibot.validator |
Validator slug |
org.validibot.started_at |
ISO timestamp |
org.validibot.timeout_seconds |
Configured timeout |
Cleanup strategies:
- On-demand - Container removed after each run (normal path)
- Periodic sweep - Background task cleans up orphaned containers every 10 minutes
- Startup cleanup - Worker removes leftover containers on startup
Manual cleanup: