Backup and recovery¶
Production procedure for backing up kneo-serv state, verifying restores,
and rolling back a deployment. This page consolidates the operator surface;
the underlying Python API and SQL commands stay in their respective
references.
For the upgrade context that ends in "and keep a backup", see
upgrade.md. For the Python backup API used by the SQLite
maintenance helpers, see
service_api.md § Backup and restore.
What needs to be preserved¶
| State | Where it lives | Backup mechanism |
|---|---|---|
| Run state, queue, checkpoints, audit events, idempotency, locks, policies | PostgreSQL (KNEO_SERV_DATABASE_URL set) or SQLite at .kneo/kneo_runs.sqlite (default) |
pg_dump / SQLite online-backup |
| Workflow continuations | PostgreSQL when set, otherwise files under .kneo/continuations/ |
DB dump or filesystem backup |
| Spec bundles | Source repo + your CI artifacts (signed bundles) | Repo + artifact store |
| Artifacts (workflow outputs) | Filesystem paths declared by your specs | Filesystem backup |
| Logs | stdout via container log driver → log aggregator | Aggregator retention |
The DB is the load-bearing piece. Everything else can be reconstructed from the DB and your spec repo, except for filesystem-stored continuations and artifacts when PostgreSQL is not configured.
PostgreSQL — production path¶
The Compose stack and any production deployment should set
KNEO_SERV_DATABASE_URL. In that mode all state above (except artifacts)
lives in PostgreSQL.
Take a backup¶
docker compose --env-file deploy/production.env exec db \
pg_dump -U "$POSTGRES_USER" "$POSTGRES_DB" \
| gzip > "kneo_serv-$(date +%Y%m%d-%H%M).sql.gz"
For a host-level Postgres install, run pg_dump directly as the
postgres user; the data shape is the same.
Restore from a backup (destructive — wipes current state)¶
gunzip -c kneo_serv-YYYYmmDD-HHMM.sql.gz \
| docker compose --env-file deploy/production.env exec -T db \
psql -U "$POSTGRES_USER" "$POSTGRES_DB"
Restore replaces every row in the database. Stop the API container first so no in-flight write races the restore:
docker compose --env-file deploy/production.env stop api
gunzip -c kneo_serv-YYYYmmDD-HHMM.sql.gz \
| docker compose --env-file deploy/production.env exec -T db \
psql -U "$POSTGRES_USER" "$POSTGRES_DB"
docker compose --env-file deploy/production.env start api
Off-site rotation¶
Local backups protect against operator error, not host loss. After each dump, copy the gzip off the host:
- S3, Azure Blob, or GCS bucket with versioning + lifecycle to archive older dumps.
- Encrypt at rest (server-side encryption is sufficient if your control plane is locked down; client-side encryption for stricter regimes).
- Apply a separate IAM identity for upload-only versus read.
Data-only restore into a clean volume¶
For test-restore drills and disaster recovery into a fresh PostgreSQL
volume, the service handles schema migrations on startup. Capture a
data-only dump and exclude the schema_migrations rows so the new
volume's migration state isn't overwritten:
docker compose --env-file deploy/production.env exec -T db \
pg_dump -U "$POSTGRES_USER" -d "$POSTGRES_DB" --data-only --inserts \
-f /tmp/kneo_serv_data.sql
docker cp <db-container-id>:/tmp/kneo_serv_data.sql /tmp/kneo_serv_data.sql
grep -v "INSERT INTO public.schema_migrations" /tmp/kneo_serv_data.sql \
> /tmp/kneo_serv_data_restore.sql
Restore into a clean volume after the API has come up at least once (so migrations have run):
docker compose --env-file deploy/production.env down -v
docker compose --env-file deploy/production.env up --build -d
docker cp /tmp/kneo_serv_data_restore.sql \
<db-container-id>:/tmp/kneo_serv_data_restore.sql
docker compose --env-file deploy/production.env exec -T db \
psql -v ON_ERROR_STOP=1 -U "$POSTGRES_USER" -d "$POSTGRES_DB" \
-f /tmp/kneo_serv_data_restore.sql
docker compose --env-file deploy/production.env restart api
ON_ERROR_STOP=1 aborts the restore on the first failing INSERT so you
don't end up with partial state.
SQLite — single-host installs¶
When KNEO_SERV_DATABASE_URL is unset, run state lives in
.kneo/kneo_runs.sqlite and continuations in .kneo/continuations/. The
service ships an online backup helper:
from kneo_serv.maintenance import backup_sqlite_database, restore_sqlite_database
# Online — safe while the service is running
backup_sqlite_database(
".kneo/kneo_runs.sqlite",
".kneo/backups/kneo_runs-2026-05-12.sqlite",
)
# Restore into a new location, then swap into place during a window
restore_sqlite_database(
".kneo/backups/kneo_runs-2026-05-12.sqlite",
".kneo/kneo_runs.restored.sqlite",
)
backup_sqlite_database uses SQLite's backup() API and is safe to run
against a live database. restore_sqlite_database is a plain file copy —
stop the service before swapping the restored file into the live path,
or you'll race the writer.
Also back up .kneo/continuations/ and any artifact paths your specs
write to; these are not inside the SQLite file.
Backup frequency¶
There is no single recommended cadence. Tie it to your retention policy and your tolerance for re-running work:
| Workload shape | Cadence |
|---|---|
| Low run volume, short retention | Daily dump, 30-day retention |
Active production, multi-day retention enabled (KNEO_SERV_RETENTION_*) |
Hourly dump, 7-day retention; daily off-site copy |
| Audit-heavy compliance workloads | Per-hour dump kept for the compliance window; verified test-restore monthly |
The relevant env vars are in
environment.md § Retention. A retention
policy that prunes runs after 7 days needs backups newer than 7 days, or
the restore set is empty.
Verifying a restore¶
Backups are unproven until they have been restored. Verify on the schedule below, not after a real incident.
- Provision a scratch host or namespace and restore the backup into it.
- Start
kneo-servagainst the restored database. - Verify dependencies:
- Verify a known run survived:
- Run the deployment smoke against the restored stack
(
deployment_smoke.md). It exercises run create → fetch → cancel and confirms checkpoints persist. - Verify audit events from before the backup are present:
Recommended cadence: monthly restore drill into a scratch environment, plus a restore drill immediately before any major upgrade.
Rolling back after a failed upgrade¶
Migrations are schema-forward and not safe to downgrade in place. If a new release misbehaves and the issue can't be patched forward:
- Stop the service. Quiesce writers; the proxy can keep returning
503from/readyzuntil step 5. - Restore persistence from the pre-upgrade backup using the PostgreSQL or SQLite procedure above.
- Re-install the previous version. Pin the image tag or reinstall the pip package at the prior version. Update Compose / Kubernetes manifests accordingly.
- Restart the service.
- Verify with
curl /readyzand the deployment smoke (deployment_smoke.md).
Keep the pre-upgrade backup until you have verified the new version through at least one business cycle.
Disaster recovery checklist¶
| Scenario | Recovery |
|---|---|
| Lost the host, database survived | Provision new host → install kneo-serv → point KNEO_SERV_DATABASE_URL at the surviving DB → start. |
| Lost the database | Provision DB → restore latest dump → start service → verify /readyz and a known run. |
| Lost host and database | Provision DB → restore latest off-site dump → provision host → start service → verify. |
| Corrupted checkpoints for one run | Use GET /v1/runs/{run_id}/checkpoints/diff to identify the bad checkpoint; cancel and re-run from the last good step. The DB itself is fine. |
| Restore brought back stale data, signs of mismatch | See troubleshooting.md § 2.5 for the recovery shape. |
What this page does not cover¶
- Performance and capacity sizing. Deferred until the benchmark
suite ships — see
TODO-docs.md § Performance and capacity guide. - The Python backup API surface. Stays in
service_api.md § Backup and restore. - Release-team verification gates for the GA cut. Those live in
release_checklist.md.