Tutorial: deploying with PostgreSQL from zero¶
End-to-end deployment of kneo-serv against PostgreSQL using the bundled
Docker Compose stack: rendering env files, starting the service, verifying
readiness, and running smoke tests. Budget about 30 minutes from a fresh
checkout to a running deployment.
For the reference on deployment shapes and persistence selection, see
deployment.md. For environment-variable semantics, see
environment.md.
Prerequisites¶
- Docker 24+ and
docker compose. git,curl,jq, and a shell that supports$()substitution.python3≥ 3.12 for running the deployment-smoke script.- Network access to pull the
postgres:16and Python base images.
This tutorial uses 127.0.0.1; for a real deployment, substitute your host or load-balancer URL throughout.
1 · Clone and prepare the env file¶
git clone git@github.com:kneo-agent/kneo-serv.git
cd kneo-serv
cp deploy/production.env.example deploy/production.env
chmod 600 deploy/production.env
deploy/production.env is gitignored — it'll hold your real secrets.
Edit it now and replace every replace-… placeholder. The minimum
set you must change before binding to a network:
# deploy/production.env
# Auth — replace each token with a high-entropy value.
KNEO_SERV_AUTH_ENABLED=true
KNEO_SERV_API_KEYS=operator:OP_TOKEN:operator;reviewer:REV_TOKEN:reviewer;viewer:VIEW_TOKEN:viewer
KNEO_SERV_ADMIN_API_KEY=ADMIN_TOKEN
KNEO_SERV_SPEC_SIGNING_KEY=SPEC_SIGNING_HMAC_KEY
# PostgreSQL — match the password to the one you'll set below.
POSTGRES_DB=kneo_serv
POSTGRES_USER=kneo_serv
POSTGRES_PASSWORD=DB_PASSWORD
Generate strong tokens:
# 32-byte hex tokens
for name in OP_TOKEN REV_TOKEN VIEW_TOKEN ADMIN_TOKEN SPEC_SIGNING_HMAC_KEY DB_PASSWORD; do
printf '%s=%s\n' "$name" "$(openssl rand -hex 32)"
done
Paste the generated values into deploy/production.env. Do not commit
this file.
2 · Validate the env file¶
There's a validator that catches common mistakes (placeholder tokens, incomplete scoped roles, missing DSN, telemetry payload capture left on by accident):
Address any errors before continuing. Common findings:
replace-…strings still present.- Scoped role list missing
operator,reviewer, orviewer. KNEO_SERV_OTEL_RECORD_ARGUMENTS=truewithout an explicit data-classification override.
3 · Start the Compose stack¶
The stack runs the API plus PostgreSQL with a persistent volume.
--build rebuilds the API image so any local edits land. -d runs
detached. To watch logs:
You should see the API come up after PostgreSQL passes its healthcheck (about 10–20 seconds on a cold start). The API logs include a line per migration applied at first startup.
4 · Verify readiness¶
/livez returns {"ok": true, "metadata": {}} as soon as the process
accepts connections. /readyz returns 200 only after every dependency
check passes:
{
"ok": true,
"metadata": {
"ready": true,
"manager": "PlatformManager",
"checks": {
"run_state_store": {"name": "run_state_store", "ok": true},
"continuation_store": {"name": "continuation_store", "ok": true},
"queue": {"name": "queue", "ok": true},
"runtime_registry": {"name": "runtime_registry", "ok": true, "count": 3, "names": ["adapter", "bridge", "native"]},
"tool_registry": {"name": "tool_registry", "ok": true, "count": 4, "names": ["compress_history", "publish_report", "summarize", "web_search"]},
"providers": {"name": "providers", "ok": true},
"mcp": {"name": "mcp", "ok": true}
}
}
}
If you get a 503, the body identifies which check failed; see troubleshooting.md § 1.2.
5 · Run the deployment smoke¶
The smoke script exercises the full path: auth, spec validation, run creation, human resume, audit listing, credential inventory, and policy update.
export OP_TOKEN=<your-operator-token>
export REV_TOKEN=<your-reviewer-token>
export VIEW_TOKEN=<your-viewer-token>
python scripts/deployment_smoke.py \
--base-url "$BASE" \
--operator-token "$OP_TOKEN" \
--reviewer-token "$REV_TOKEN" \
--viewer-token "$VIEW_TOKEN"
A clean run prints each step with a PASS and exits 0. If any step
fails, the script identifies the failing endpoint and HTTP status.
See deployment_smoke.md for the full step list
and what each step verifies.
6 · Submit a real run¶
curl -sf -X POST "$BASE/v1/runs" \
-H "Authorization: Bearer $OP_TOKEN" \
-H 'Content-Type: application/json' \
-d '{
"input": "smoke",
"spec_path": "examples/smoke_human_workflow.yaml",
"target": "workflow"
}' | jq
This spec uses the in-process dummy provider so it runs without
provider credentials. You should see a paused response with a
continuation_id (the workflow has a human step). Resume it:
curl -sf -X POST "$BASE/v1/human-tasks/cont_…/resume" \
-H "Authorization: Bearer $REV_TOKEN" \
-H 'Content-Type: application/json' \
-d '{"request_id": "req_…", "decision": "approved"}' | jq
For the full HITL flow, see human_in_the_loop.md.
7 · Verify persistence survives restart¶
Confirm PostgreSQL volume persistence:
docker compose --env-file deploy/production.env restart api
sleep 5
curl -sf "$BASE/v1/runs?limit=5" \
-H "Authorization: Bearer $OP_TOKEN" | jq '.runs[].run_id'
You should see the run id from step 6 in the list, even after the API
container restarts. The data lives in the postgres-data named
volume, not the container layer.
8 · Capacity tuning knobs¶
For a real production deployment, revisit these env vars in
deploy/production.env after you have load profile data:
| Variable | Default | Tune when |
|---|---|---|
KNEO_SERV_PROVIDER_TIMEOUT_SECONDS |
120 | Provider tail latency exceeds default. |
KNEO_SERV_PROVIDER_RETRIES |
2 | Provider has documented transient error rate. |
KNEO_SERV_MAX_BODY_BYTES |
1 MiB | You receive larger inline specs or override payloads. |
KNEO_SERV_MAX_INPUT_CHARS |
20000 | Run inputs are larger than the default. |
KNEO_SERV_RETENTION_RUNS_DAYS |
unset | Storage growth requires capping run history. |
KNEO_SERV_CHECKPOINT_COMPRESS_BYTES |
64 KiB | Many large checkpoints; reduce to compress more. |
Full list and semantics: environment.md.
9 · Backup the database¶
A seeded backup/restore drill is part of the release checklist. The shape for production:
# Backup
docker compose --env-file deploy/production.env exec db \
pg_dump -U "$POSTGRES_USER" "$POSTGRES_DB" \
| gzip > "kneo_serv-$(date +%Y%m%d-%H%M).sql.gz"
# Restore (DESTRUCTIVE — wipes current data)
gunzip -c kneo_serv-YYYYmmDD-HHMM.sql.gz \
| docker compose --env-file deploy/production.env exec -T db \
psql -U "$POSTGRES_USER" "$POSTGRES_DB"
The full drill — including verifying that runs, checkpoints, audit events, and policy metadata survive a restore — is in release_checklist.md.
10 · Tear down¶
To stop the stack but keep data:
To stop the stack and delete the PostgreSQL volume (destroys runs, checkpoints, audit events, continuations):
Use the volume-deleting form for clean re-tests; never run it against a production deployment without a verified backup.
Common failure modes¶
| Symptom | See |
|---|---|
API container restarts with KNEO service auth is enabled but no API keys are configured |
troubleshooting.md § 1.1 |
/readyz 503 with run_state_store not ok |
troubleshooting.md § 2.2 |
| Service writes to SQLite even with DSN set | troubleshooting.md § 2.1 |
| Smoke script fails on policy write | API key probably missing policies:write; see troubleshooting.md § 4.2 |
Where to go next¶
- staging_release_runbook.md — promotion path beyond a single host.
- deployment.md — reference for deployment shapes, including running without Compose.
- environment.md — every env var.
- tutorial_custom_tool.md — extend this deployment with custom tools.