Execution Backends¶
Validibot supports multiple deployment targets through an abstracted execution backend system. This document describes how the platform orchestrates advanced validator containers across different infrastructure.
For the container interface that validators must implement, see Advanced Validator Container Interface.
Overview¶
The execution layer sits between the validator and the infrastructure:
Validator → ExecutionBackend → Infrastructure
↓
┌───────────────┼───────────────┐
↓ ↓ ↓
DockerComposeBackend GCPBackend AWSBackend
(Docker socket) (Cloud Run+GCS) (future)
Each backend handles:
- Preparing input data (uploading envelopes to storage)
- Launching validator containers
- Collecting results (synchronously or via callbacks)
- Cleaning up resources
Backend Selection¶
The backend is selected via the VALIDATOR_RUNNER setting:
| Setting Value | Backend | Execution Model |
|---|---|---|
"docker" |
DockerComposeExecutionBackend |
Synchronous |
"google_cloud_run" |
GCPExecutionBackend |
Asynchronous |
If VALIDATOR_RUNNER is not set, the system auto-detects:
- If
GCP_PROJECT_IDis set → Uses GCP backend - Otherwise → Uses Docker backend (Docker Compose)
Execution Models¶
Synchronous (Docker Compose)¶
Used for Docker Compose deployments where validators run as local Docker containers.
1. Validator calls backend.execute(request)
2. Backend writes input envelope to local storage (file:// URI)
3. Backend spawns Docker container and waits for completion
4. Backend reads output envelope from local storage
5. Returns complete ExecutionResponse with results
Characteristics:
- Blocking call — validation completes before returning
- Simple deployment — just Docker and shared volumes
- Resource limits enforced via Docker
- Container cleanup handled by labels (Ryuk pattern)
Asynchronous (GCP Cloud Run)¶
Used for GCP deployments where validators run as Cloud Run Jobs.
1. Validator calls backend.execute(request)
2. Backend uploads input envelope to GCS (gs:// URI)
3. Backend triggers Cloud Run Job (non-blocking)
4. Returns ExecutionResponse with is_complete=False
5. Container POSTs callback to Django when complete
6. Callback handler loads output envelope from GCS
Characteristics:
- Non-blocking — validation runs in background
- Scalable — Cloud Run handles concurrency
- Callback-based — results arrive via authenticated HTTP POST
- IAM-secured — no shared secrets, Google-signed ID tokens
Two-Layer Architecture¶
The execution system uses a two-layer architecture:
ExecutionBackend (high-level orchestration)
├── Storage management (upload/download envelopes)
├── Envelope building (input envelope construction)
├── Status checking (check_status() for reconciliation)
└── Delegates to → ValidatorRunner (low-level container execution)
├── Container spawn/wait/remove
├── Security hardening (cap_drop, read_only, etc.)
├── Container labeling (Ryuk pattern)
└── Container cleanup (orphan sweep, startup cleanup)
Why two layers?
- ExecutionBackend handles orchestration: it knows about storage URIs, envelopes, and the callback protocol. It doesn't know how containers are spawned.
- ValidatorRunner handles container lifecycle: it knows about Docker APIs and Cloud Run Jobs. It doesn't know about envelopes or callbacks.
This separation means new deployment targets only need a new runner (for container execution) and a new backend (for storage integration), without duplicating orchestration logic.
| Layer | Docker Compose | GCP |
|---|---|---|
| Backend | DockerComposeExecutionBackend |
GCPExecutionBackend |
| Runner | DockerValidatorRunner |
GoogleCloudRunValidatorRunner |
| Storage | Local filesystem (file://) |
GCS (gs://) |
| Execution | Sync (blocking) | Async (callback) |
Code Location¶
validibot/validations/services/
├── execution/ # Backend layer (high-level)
│ ├── __init__.py # Exports get_execution_backend()
│ ├── base.py # ExecutionBackend ABC, ExecutionRequest, ExecutionResponse
│ ├── docker_compose.py # DockerComposeExecutionBackend
│ ├── gcp.py # GCPExecutionBackend
│ └── registry.py # Backend selection and caching
├── runners/ # Runner layer (low-level)
│ ├── __init__.py # Exports get_validator_runner()
│ ├── base.py # ValidatorRunner ABC, ExecutionStatus, ExecutionResult
│ ├── docker.py # DockerValidatorRunner (labels, security, cleanup)
│ └── google_cloud_run.py # GoogleCloudRunValidatorRunner
└── validation_callback.py # Callback processing (for async backends)
Usage in Validators¶
from validibot.validations.services.execution import get_execution_backend
from validibot.validations.services.execution.base import ExecutionRequest
backend = get_execution_backend()
request = ExecutionRequest(
run=validation_run,
validator=validator,
submission=submission,
step=workflow_step,
)
response = backend.execute(request)
if backend.is_async:
# Results will arrive via callback
return ValidationResult(passed=None, issues=[], stats={...})
else:
# Results available immediately
return process_output_envelope(response.output_envelope)
Docker Compose Backend Details¶
Architecture¶
┌──────────────────────────────────────────────────────────────────┐
│ Docker Host │
│ │
│ ┌──────────────────┐ ┌─────────────────────────────────┐ │
│ │ Django + Worker │ │ Validator Container │ │
│ │ │ │ ($GCP_APP_NAME-validator-X) │ │
│ │ - Web app │───▶│ │ │
│ │ - Celery │ │ Reads: file:///input │ │
│ │ │◀───│ Writes: file:///output │ │
│ └──────────────────┘ └─────────────────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Shared Storage Volume │ │
│ │ /app/storage (Docker volume) │ │
│ └──────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
Configuration¶
# config/settings/production.py (when DEPLOYMENT_TARGET=docker_compose)
VALIDATOR_RUNNER = "docker"
VALIDATOR_RUNNER_OPTIONS = {
"memory_limit": "4g",
"cpu_limit": "2.0",
"network": None, # None = no network (default, most secure)
"timeout_seconds": 3600,
}
# Container images
VALIDATOR_IMAGE_TAG = "latest"
VALIDATOR_IMAGE_REGISTRY = "" # Or your private registry
Network Isolation (Security)¶
By default, advanced validator containers run with no network access (network_mode='none'). This is the most secure configuration because:
- Containers cannot reach other services (web, database, redis)
- Containers cannot access the internet
- All I/O happens via the shared storage volume
This works because:
- Input files are written to the shared volume before the container starts
- The container reads inputs and writes outputs to the same volume
- The worker reads the output after the container exits
When to enable network access:
Set VALIDATOR_NETWORK only if advanced validators need to:
- Download files from external URLs during execution
- Call external APIs as part of validation logic
# In docker-compose.*.yml, uncomment to enable network:
environment:
- VALIDATOR_NETWORK=validibot_validibot
With network enabled, advanced validator containers can reach:
- Other containers on the same Docker network
- External internet (if the host has connectivity)
Compose Project Naming Requirements¶
The Docker Compose backend requires specific naming for networks and volumes. By default, Docker Compose prefixes resource names with the project name (derived from the directory name or COMPOSE_PROJECT_NAME).
The shipped compose files assume COMPOSE_PROJECT_NAME=validibot, which creates:
| Resource | Full Name |
|---|---|
| Network | validibot_validibot |
| Storage Volume | validibot_validibot_storage (production) |
| Storage Volume | validibot_validibot_local_storage (local) |
These names are configured in the compose files via environment variables:
environment:
- VALIDATOR_NETWORK=validibot_validibot
- VALIDATOR_STORAGE_VOLUME=validibot_validibot_storage
If you change the project name (via COMPOSE_PROJECT_NAME or running from a different directory), you must update these environment variables to match. Otherwise, the worker cannot attach advanced validator containers to the correct network or volume.
To check your current project name:
# The project name is the prefix before the underscore in container names
docker compose -f docker-compose.production.yml ps --format "{{.Name}}"
# Example output: validibot_web_1 → project name is "validibot"
To override explicitly:
# Set project name explicitly
COMPOSE_PROJECT_NAME=validibot docker compose -f docker-compose.production.yml up -d
Private Registry Authentication¶
By default, validator images are pulled from Docker Hub. If you're using a private registry (GitHub Container Registry, AWS ECR, Google Artifact Registry, etc.), you need to configure Docker credentials on the host.
Option 1: Docker login on the host
# Log in to your registry on the Docker host
docker login ghcr.io -u USERNAME -p TOKEN
# Or for AWS ECR
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 123456789.dkr.ecr.us-east-1.amazonaws.com
The Docker daemon stores credentials in ~/.docker/config.json and uses them for pulls. Since the worker spawns containers via the host's Docker socket, these credentials apply to validator image pulls automatically.
Option 2: Pass credentials via environment
For registries that support credential helpers, configure them on the host:
Image naming:
Configure the validator image registry in your environment:
# .envs/.production/.docker-compose/.django
VALIDATOR_IMAGE_REGISTRY=ghcr.io/your-org
VALIDATOR_IMAGE_TAG=v1.2.0
Images are pulled as {VALIDATOR_IMAGE_REGISTRY}/$GCP_APP_NAME-validator-{type}:{tag}. For example:
ghcr.io/your-org/$GCP_APP_NAME-validator-energyplus:v1.2.0ghcr.io/your-org/$GCP_APP_NAME-validator-fmu:v1.2.0
Image availability:
The Docker backend does not automatically pull images. Ensure validator images are available before running validations:
# Pre-pull images on the host
docker pull ghcr.io/your-org/$GCP_APP_NAME-validator-energyplus:v1.2.0
Or configure a pull policy by extending the runner options if automatic pulls are needed.
Container Management¶
Validator containers are labeled for identification and cleanup:
org.validibot.managed=true
org.validibot.run_id=<run-id>
org.validibot.validator=<slug>
org.validibot.started_at=<iso>
org.validibot.timeout_seconds=N
Cleanup strategies:
- On-demand — Container removed after run completes
- Periodic sweep — Background task every 10 minutes
- Startup cleanup — Worker removes leftover containers on start
Management command:
# Show what would be cleaned up
python manage.py cleanup_containers --dry-run
# Remove orphaned containers
python manage.py cleanup_containers
# Remove ALL managed containers
python manage.py cleanup_containers --all
GCP Backend Details¶
For detailed GCP architecture including Cloud Run Jobs, IAM configuration, and callback flow, see:
- Validator Containers (Cloud Run) — Job execution and callbacks
- GCP Deployment — Service deployment
- IAM & Service Accounts — Security configuration
Key Concepts¶
Web/Worker Split:
$GCP_APP_NAME-web— Public UI and API$GCP_APP_NAME-worker— Private, receives callbacks from validator jobs
Callback Authentication:
- Validator jobs use Google-signed ID tokens
- Worker service requires IAM authentication
- No shared secrets in envelopes
Storage:
- Input/output envelopes stored in GCS
- URIs use
gs://scheme - Service accounts need appropriate storage permissions
Status Checking¶
The ExecutionBackend base class provides a check_status() method for querying the state of a running or completed execution:
def check_status(self, execution_id: str) -> ExecutionResponse | None:
"""Check execution status. Returns None if not supported."""
return None
| Backend | Behavior |
|---|---|
DockerComposeExecutionBackend |
Queries Docker daemon for container state. Primarily for debugging (sync execution already returns results). |
GCPExecutionBackend |
Queries Cloud Run Jobs API for execution state. Used by reconciliation to recover lost callbacks. |
This method is not abstract — backends that don't need status checking (sync backends) can leave the default None return.
Container Cleanup¶
Container lifecycle management happens at the runner layer, not the backend layer:
Docker Compose (three strategies)¶
- Immediate cleanup —
container.remove(force=True)in the runner'sfinallyblock after every execution - Periodic sweep —
cleanup_orphaned_containers()runs via Celery Beat every 10 minutes, removes containers past timeout + grace period - Startup cleanup —
cleanup_all_managed_containers()runs inAppConfig.ready(), removes all labeled containers from previous worker incarnation
All strategies use Docker container labels (org.validibot.managed, org.validibot.run_id, etc.) for identification.
GCP Cloud Run¶
Cloud Run Jobs are ephemeral — there's nothing to clean up at the container level. Error recovery is handled by the reconciliation system (see below).
Error Recovery¶
Lost Callback Recovery (GCP)¶
If a Cloud Run Job completes but its callback never reaches Django (network failure, container crash before POST), the cleanup_stuck_runs management command attempts reconciliation:
- Finds runs stuck in
RUNNINGstatus past the timeout threshold - For GCP runs, checks
step_run.outputforexecution_namemetadata - Queries Cloud Run Jobs API via
GCPExecutionBackend.check_status() - Based on result:
- Still running: Skips the run (legitimately in progress)
- Succeeded: Constructs a synthetic callback and processes through
ValidationCallbackService(reuses existing idempotency, finding persistence, assertion evaluation) - Failed: Marks the run as
FAILEDwith the Cloud Run error message - API error: Falls through to simple
TIMED_OUTmarking
This reconciliation runs automatically when cleanup_stuck_runs is scheduled (typically every 10 minutes via Cloud Scheduler).
Stuck Run Timeout (All Backends)¶
For runs where reconciliation is not possible (non-GCP, no execution metadata, API errors), the command marks them as TIMED_OUT after the configured threshold (default: 30 minutes).
# Manual invocation
python manage.py cleanup_stuck_runs
python manage.py cleanup_stuck_runs --timeout-minutes 60
python manage.py cleanup_stuck_runs --dry-run
Adding a New Backend¶
To support a new deployment target (e.g., AWS):
- Create
validibot/validations/services/execution/aws.py - Implement
ExecutionBackendabstract class - Register in
registry.py - Add setting value to backend selection logic
# aws.py
from .base import ExecutionBackend, ExecutionRequest, ExecutionResponse
class AWSExecutionBackend(ExecutionBackend):
is_async = True # or False for synchronous
def execute(self, request: ExecutionRequest) -> ExecutionResponse:
# Upload envelope to S3
# Trigger ECS task or Lambda
# Return response
...