Skip to content

Backup and recovery

Production procedure for backing up kneo-serv state, verifying restores, and rolling back a deployment. This page consolidates the operator surface; the underlying Python API and SQL commands stay in their respective references.

For the upgrade context that ends in "and keep a backup", see upgrade.md. For the Python backup API used by the SQLite maintenance helpers, see service_api.md § Backup and restore.

What needs to be preserved

State Where it lives Backup mechanism
Run state, queue, checkpoints, audit events, idempotency, locks, policies PostgreSQL (KNEO_SERV_DATABASE_URL set) or SQLite at .kneo/kneo_runs.sqlite (default) pg_dump / SQLite online-backup
Workflow continuations PostgreSQL when set, otherwise files under .kneo/continuations/ DB dump or filesystem backup
Spec bundles Source repo + your CI artifacts (signed bundles) Repo + artifact store
Artifacts (workflow outputs) Filesystem paths declared by your specs Filesystem backup
Logs stdout via container log driver → log aggregator Aggregator retention

The DB is the load-bearing piece. Everything else can be reconstructed from the DB and your spec repo, except for filesystem-stored continuations and artifacts when PostgreSQL is not configured.

PostgreSQL — production path

The Compose stack and any production deployment should set KNEO_SERV_DATABASE_URL. In that mode all state above (except artifacts) lives in PostgreSQL.

Take a backup

docker compose --env-file deploy/production.env exec db \
  pg_dump -U "$POSTGRES_USER" "$POSTGRES_DB" \
  | gzip > "kneo_serv-$(date +%Y%m%d-%H%M).sql.gz"

For a host-level Postgres install, run pg_dump directly as the postgres user; the data shape is the same.

Restore from a backup (destructive — wipes current state)

gunzip -c kneo_serv-YYYYmmDD-HHMM.sql.gz \
  | docker compose --env-file deploy/production.env exec -T db \
      psql -U "$POSTGRES_USER" "$POSTGRES_DB"

Restore replaces every row in the database. Stop the API container first so no in-flight write races the restore:

docker compose --env-file deploy/production.env stop api
gunzip -c kneo_serv-YYYYmmDD-HHMM.sql.gz \
  | docker compose --env-file deploy/production.env exec -T db \
      psql -U "$POSTGRES_USER" "$POSTGRES_DB"
docker compose --env-file deploy/production.env start api

Off-site rotation

Local backups protect against operator error, not host loss. After each dump, copy the gzip off the host:

  • S3, Azure Blob, or GCS bucket with versioning + lifecycle to archive older dumps.
  • Encrypt at rest (server-side encryption is sufficient if your control plane is locked down; client-side encryption for stricter regimes).
  • Apply a separate IAM identity for upload-only versus read.

Data-only restore into a clean volume

For test-restore drills and disaster recovery into a fresh PostgreSQL volume, the service handles schema migrations on startup. Capture a data-only dump and exclude the schema_migrations rows so the new volume's migration state isn't overwritten:

docker compose --env-file deploy/production.env exec -T db \
  pg_dump -U "$POSTGRES_USER" -d "$POSTGRES_DB" --data-only --inserts \
  -f /tmp/kneo_serv_data.sql
docker cp <db-container-id>:/tmp/kneo_serv_data.sql /tmp/kneo_serv_data.sql
grep -v "INSERT INTO public.schema_migrations" /tmp/kneo_serv_data.sql \
  > /tmp/kneo_serv_data_restore.sql

Restore into a clean volume after the API has come up at least once (so migrations have run):

docker compose --env-file deploy/production.env down -v
docker compose --env-file deploy/production.env up --build -d
docker cp /tmp/kneo_serv_data_restore.sql \
  <db-container-id>:/tmp/kneo_serv_data_restore.sql
docker compose --env-file deploy/production.env exec -T db \
  psql -v ON_ERROR_STOP=1 -U "$POSTGRES_USER" -d "$POSTGRES_DB" \
  -f /tmp/kneo_serv_data_restore.sql
docker compose --env-file deploy/production.env restart api

ON_ERROR_STOP=1 aborts the restore on the first failing INSERT so you don't end up with partial state.

SQLite — single-host installs

When KNEO_SERV_DATABASE_URL is unset, run state lives in .kneo/kneo_runs.sqlite and continuations in .kneo/continuations/. The service ships an online backup helper:

from kneo_serv.maintenance import backup_sqlite_database, restore_sqlite_database

# Online — safe while the service is running
backup_sqlite_database(
    ".kneo/kneo_runs.sqlite",
    ".kneo/backups/kneo_runs-2026-05-12.sqlite",
)

# Restore into a new location, then swap into place during a window
restore_sqlite_database(
    ".kneo/backups/kneo_runs-2026-05-12.sqlite",
    ".kneo/kneo_runs.restored.sqlite",
)

backup_sqlite_database uses SQLite's backup() API and is safe to run against a live database. restore_sqlite_database is a plain file copy — stop the service before swapping the restored file into the live path, or you'll race the writer.

Also back up .kneo/continuations/ and any artifact paths your specs write to; these are not inside the SQLite file.

Backup frequency

There is no single recommended cadence. Tie it to your retention policy and your tolerance for re-running work:

Workload shape Cadence
Low run volume, short retention Daily dump, 30-day retention
Active production, multi-day retention enabled (KNEO_SERV_RETENTION_*) Hourly dump, 7-day retention; daily off-site copy
Audit-heavy compliance workloads Per-hour dump kept for the compliance window; verified test-restore monthly

The relevant env vars are in environment.md § Retention. A retention policy that prunes runs after 7 days needs backups newer than 7 days, or the restore set is empty.

Verifying a restore

Backups are unproven until they have been restored. Verify on the schedule below, not after a real incident.

  1. Provision a scratch host or namespace and restore the backup into it.
  2. Start kneo-serv against the restored database.
  3. Verify dependencies:
    curl -sf http://127.0.0.1:8000/readyz | jq '.metadata.ready'   # → true
    
  4. Verify a known run survived:
    curl -sf "http://127.0.0.1:8000/v1/runs?limit=5" \
      -H "Authorization: Bearer $OP_TOKEN" | jq '.runs[].run_id'
    
  5. Run the deployment smoke against the restored stack (deployment_smoke.md). It exercises run create → fetch → cancel and confirms checkpoints persist.
  6. Verify audit events from before the backup are present:
    curl -sf "http://127.0.0.1:8000/v1/audit-events?limit=5" \
      -H "Authorization: Bearer $OP_TOKEN" | jq '.events[].event_type'
    

Recommended cadence: monthly restore drill into a scratch environment, plus a restore drill immediately before any major upgrade.

Rolling back after a failed upgrade

Migrations are schema-forward and not safe to downgrade in place. If a new release misbehaves and the issue can't be patched forward:

  1. Stop the service. Quiesce writers; the proxy can keep returning 503 from /readyz until step 5.
    docker compose --env-file deploy/production.env stop api
    
  2. Restore persistence from the pre-upgrade backup using the PostgreSQL or SQLite procedure above.
  3. Re-install the previous version. Pin the image tag or reinstall the pip package at the prior version. Update Compose / Kubernetes manifests accordingly.
  4. Restart the service.
    docker compose --env-file deploy/production.env start api
    
  5. Verify with curl /readyz and the deployment smoke (deployment_smoke.md).

Keep the pre-upgrade backup until you have verified the new version through at least one business cycle.

Disaster recovery checklist

Scenario Recovery
Lost the host, database survived Provision new host → install kneo-serv → point KNEO_SERV_DATABASE_URL at the surviving DB → start.
Lost the database Provision DB → restore latest dump → start service → verify /readyz and a known run.
Lost host and database Provision DB → restore latest off-site dump → provision host → start service → verify.
Corrupted checkpoints for one run Use GET /v1/runs/{run_id}/checkpoints/diff to identify the bad checkpoint; cancel and re-run from the last good step. The DB itself is fine.
Restore brought back stale data, signs of mismatch See troubleshooting.md § 2.5 for the recovery shape.

What this page does not cover

  • Performance and capacity sizing. Deferred until the benchmark suite ships — see TODO-docs.md § Performance and capacity guide.
  • The Python backup API surface. Stays in service_api.md § Backup and restore.
  • Release-team verification gates for the GA cut. Those live in release_checklist.md.