Backup and recovery¶

Production procedure for backing up kneo-serv state, verifying restores, and rolling back a deployment. This page consolidates the operator surface; the underlying Python API and SQL commands stay in their respective references.

For the upgrade context that ends in "and keep a backup", see upgrade.md. For the Python backup API used by the SQLite maintenance helpers, see service_api.md § Backup and restore.

What needs to be preserved¶

State	Where it lives	Backup mechanism
Run state, queue, checkpoints, audit events, idempotency, locks, policies	PostgreSQL (`KNEO_SERV_DATABASE_URL` set) or SQLite at `.kneo/kneo_runs.sqlite` (default)	`pg_dump` / SQLite online-backup
Workflow continuations	PostgreSQL when set, otherwise files under `.kneo/continuations/`	DB dump or filesystem backup
Spec bundles	Source repo + your CI artifacts (signed bundles)	Repo + artifact store
Artifacts (workflow outputs)	Filesystem paths declared by your specs	Filesystem backup
Logs	stdout via container log driver → log aggregator	Aggregator retention

The DB is the load-bearing piece. Everything else can be reconstructed from the DB and your spec repo, except for filesystem-stored continuations and artifacts when PostgreSQL is not configured.

PostgreSQL — production path¶

The Compose stack and any production deployment should set KNEO_SERV_DATABASE_URL. In that mode all state above (except artifacts) lives in PostgreSQL.

Take a backup¶

docker compose --env-file deploy/production.env exec db \
  pg_dump -U "$POSTGRES_USER" "$POSTGRES_DB" \
  | gzip > "kneo_serv-$(date +%Y%m%d-%H%M).sql.gz"

For a host-level Postgres install, run pg_dump directly as the postgres user; the data shape is the same.

Restore from a backup (destructive — wipes current state)¶

gunzip -c kneo_serv-YYYYmmDD-HHMM.sql.gz \
  | docker compose --env-file deploy/production.env exec -T db \
      psql -U "$POSTGRES_USER" "$POSTGRES_DB"

Restore replaces every row in the database. Stop the API container first so no in-flight write races the restore:

docker compose --env-file deploy/production.env stop api
gunzip -c kneo_serv-YYYYmmDD-HHMM.sql.gz \
  | docker compose --env-file deploy/production.env exec -T db \
      psql -U "$POSTGRES_USER" "$POSTGRES_DB"
docker compose --env-file deploy/production.env start api

Off-site rotation¶

Local backups protect against operator error, not host loss. After each dump, copy the gzip off the host:

S3, Azure Blob, or GCS bucket with versioning + lifecycle to archive older dumps.
Encrypt at rest (server-side encryption is sufficient if your control plane is locked down; client-side encryption for stricter regimes).
Apply a separate IAM identity for upload-only versus read.

Data-only restore into a clean volume¶

For test-restore drills and disaster recovery into a fresh PostgreSQL volume, the service handles schema migrations on startup. Capture a data-only dump and exclude the schema_migrations rows so the new volume's migration state isn't overwritten:

docker compose --env-file deploy/production.env exec -T db \
  pg_dump -U "$POSTGRES_USER" -d "$POSTGRES_DB" --data-only --inserts \
  -f /tmp/kneo_serv_data.sql
docker cp <db-container-id>:/tmp/kneo_serv_data.sql /tmp/kneo_serv_data.sql
grep -v "INSERT INTO public.schema_migrations" /tmp/kneo_serv_data.sql \
  > /tmp/kneo_serv_data_restore.sql

Restore into a clean volume after the API has come up at least once (so migrations have run):

docker compose --env-file deploy/production.env down -v
docker compose --env-file deploy/production.env up --build -d
docker cp /tmp/kneo_serv_data_restore.sql \
  <db-container-id>:/tmp/kneo_serv_data_restore.sql
docker compose --env-file deploy/production.env exec -T db \
  psql -v ON_ERROR_STOP=1 -U "$POSTGRES_USER" -d "$POSTGRES_DB" \
  -f /tmp/kneo_serv_data_restore.sql
docker compose --env-file deploy/production.env restart api

ON_ERROR_STOP=1 aborts the restore on the first failing INSERT so you don't end up with partial state.

SQLite — single-host installs¶

When KNEO_SERV_DATABASE_URL is unset, run state lives in .kneo/kneo_runs.sqlite and continuations in .kneo/continuations/. The service ships an online backup helper:

from kneo_serv.maintenance import backup_sqlite_database, restore_sqlite_database

# Online — safe while the service is running
backup_sqlite_database(
    ".kneo/kneo_runs.sqlite",
    ".kneo/backups/kneo_runs-2026-05-12.sqlite",
)

# Restore into a new location, then swap into place during a window
restore_sqlite_database(
    ".kneo/backups/kneo_runs-2026-05-12.sqlite",
    ".kneo/kneo_runs.restored.sqlite",
)

backup_sqlite_database uses SQLite's backup() API and is safe to run against a live database. restore_sqlite_database is a plain file copy — stop the service before swapping the restored file into the live path, or you'll race the writer.

Also back up .kneo/continuations/ and any artifact paths your specs write to; these are not inside the SQLite file.

Backup frequency¶

There is no single recommended cadence. Tie it to your retention policy and your tolerance for re-running work:

Workload shape	Cadence
Low run volume, short retention	Daily dump, 30-day retention
Active production, multi-day retention enabled (`KNEO_SERV_RETENTION_*`)	Hourly dump, 7-day retention; daily off-site copy
Audit-heavy compliance workloads	Per-hour dump kept for the compliance window; verified test-restore monthly

The relevant env vars are in environment.md § Retention. A retention policy that prunes runs after 7 days needs backups newer than 7 days, or the restore set is empty.

Verifying a restore¶

Backups are unproven until they have been restored. Verify on the schedule below, not after a real incident.

Provision a scratch host or namespace and restore the backup into it.
Start kneo-serv against the restored database.

Verify dependencies:

curl -sf http://127.0.0.1:8000/readyz | jq '.metadata.ready'   # → true

Verify a known run survived:

curl -sf "http://127.0.0.1:8000/v1/runs?limit=5" \
  -H "Authorization: Bearer $OP_TOKEN" | jq '.runs[].run_id'

Run the deployment smoke against the restored stack (deployment_smoke.md). It exercises run create → fetch → cancel and confirms checkpoints persist.

Verify audit events from before the backup are present:

curl -sf "http://127.0.0.1:8000/v1/audit-events?limit=5" \
  -H "Authorization: Bearer $OP_TOKEN" | jq '.events[].event_type'

Recommended cadence: monthly restore drill into a scratch environment, plus a restore drill immediately before any major upgrade.

Rolling back after a failed upgrade¶

Migrations are schema-forward and not safe to downgrade in place. If a new release misbehaves and the issue can't be patched forward:

Stop the service. Quiesce writers; the proxy can keep returning 503 from /readyz until step 5.
```
docker compose --env-file deploy/production.env stop api
```
Restore persistence from the pre-upgrade backup using the PostgreSQL or SQLite procedure above.
Re-install the previous version. Pin the image tag or reinstall the pip package at the prior version. Update Compose / Kubernetes manifests accordingly.

Restart the service.

docker compose --env-file deploy/production.env start api

Verify with curl /readyz and the deployment smoke (deployment_smoke.md).

Keep the pre-upgrade backup until you have verified the new version through at least one business cycle.

Disaster recovery checklist¶

Scenario	Recovery
Lost the host, database survived	Provision new host → install `kneo-serv` → point `KNEO_SERV_DATABASE_URL` at the surviving DB → start.
Lost the database	Provision DB → restore latest dump → start service → verify `/readyz` and a known run.
Lost host and database	Provision DB → restore latest off-site dump → provision host → start service → verify.
Corrupted checkpoints for one run	Use `GET /v1/runs/{run_id}/checkpoints/diff` to identify the bad checkpoint; cancel and re-run from the last good step. The DB itself is fine.
Restore brought back stale data, signs of mismatch	See `troubleshooting.md § 2.5` for the recovery shape.

What this page does not cover¶

Performance and capacity sizing. Covered in its own guide: performance.md — throughput, latency, store choice, and the bench harness for reproducing numbers on your own hardware.
The Python backup API surface. Stays in service_api.md § Backup and restore.
Release-team verification gates for the GA cut. Those live in release_checklist.md.