Incident response¶

On-call entry point for kneo-serv. This page is the triage tree — "the service looks wrong, where do I look first?" The symptom-by-symptom deep dive lives in troubleshooting.md; this page sends you there.

For backup and rollback procedures, see backup_and_recovery.md. For the API definition of the health endpoints, see service_api.md § Health checks.

Triage tree¶

   ┌── /healthz returns 200 ───── service process is alive
   │                              ─ check /readyz next
   │
1. │── /healthz times out ─────── process is down or unreachable
   │                              ─ check the container / supervisor
   │                              ─ check the reverse proxy upstream
   │
   └── /healthz 5xx ──────────────  application crashed mid-request
                                   ─ check stderr / container logs
                                   ─ see troubleshooting.md § 1

   ┌── /readyz returns 200 ────── dependencies healthy
   │                              ─ problem is in a specific run/spec
   │                              ─ check the run path below
   │
2. │── /readyz returns 503 ──────  read the `metadata.checks` payload
   │                              ─ use the matrix below to find the
   │                                failing dependency, then jump to
   │                                the matching troubleshooting § n
   │
   └── /readyz times out ────────  same path as /healthz timeout above

/healthz and /readyz are unauthenticated by design — you can probe them from anywhere you can reach the service port.

curl -sf http://<host>:<port>/healthz | jq
curl -sf http://<host>:<port>/readyz  | jq '.metadata.checks'

`/readyz` failure matrix¶

/readyz runs eight checks; when any fails, the response is 503. The body is wrapped in FastAPI's detail envelope: {"detail": {"error": "not_ready", "message": "…", "metadata": {"ready": false, "manager": "…", "checks": {...}}}} (jq '.detail.metadata.checks'). The keys you will see and what each means: (kneo_serv/service/routes_health.py)

Check key	What it probes	Common failure	Where to go
`api`	API wiring sentinel	Never fails on its own	If it does, the manager isn't configured — § 1.3
`run_state_store`	`manager.run_state_store.list_runs(limit=1)` succeeds	DB unreachable, schema missing, credentials wrong	§ 2.1 / § 2.2
`continuation_store`	`manager.continuation_store.list()` succeeds	File path missing, permissions wrong, DB issue	§ 2
`queue`	`list_queued_runs(status=…)` returns for `queued` / `running` / `failed`	Queue table missing or DB stall	§ 2, § 5.1
`runtime_registry`	Number of registered runtimes (declared via factories)	Empty registry — service started without runtimes	`extending.md` for runtime registration
`tool_registry`	Number of registered tools	Empty registry — service started without tools	`extending.md`
`providers`	Secrets named in `KNEO_SERV_HEALTH_PROVIDERS` resolve	Provider env var missing or empty	§ 3.2
`mcp`	Secrets named in `KNEO_SERV_HEALTH_MCP_SECRETS` resolve	MCP secret missing	§ 3.2

A failed check reports {"name": …, "ok": false, "error": "check_failed"} — error is a fixed literal (the probe deliberately does not leak the exception class or a detail message to an unauthenticated caller). The specific failure detail is server-side, on the kneo_serv.service logger; use the check name to jump to the matching troubleshooting section below.

Common production incidents¶

If /healthz and /readyz are both green but the service is "wrong":

Symptom	First check	Deep dive
All requests return `401`	`Authorization` / `X-Kneo-Api-Key` header is present and valid	§ 4.1, § 4.3
Specific consumer returns `403`	The key's role/scope covers the route	§ 4.2, `security_hardening.md § 2`
Async runs stuck in `queued`	Worker process is up; queue table reachable	§ 5.1
Runs hang mid-workflow	The step's tool or provider call is timing out	§ 5.3
`409 idempotency_key_conflict`	Caller is reusing a key with a different payload	§ 5.4
Tool reports `MissingSecretError`	Provider secret env var is set on the service host	§ 3.1
Logs missing `request_id`	You are reading the right logger (`kneo_serv.service`), not raw uvicorn	§ 7.1, `observability.md`
OpenTelemetry exporter silent	`KNEO_SERV_OTEL_ENABLED=true` and `[telemetry]` extra installed	§ 7.2, `observability.md`
Human task `409 resource_locked`	Another resume is in flight for the same continuation	§ 8.1
Restored backup but state looks stale or mismatched	Stop, re-verify the dump source, follow the recovery shape	§ 2.5, `backup_and_recovery.md`

What to capture before escalating¶

When the runbook doesn't have an entry that fits, capture this context before paging the on-call developer. It is the same set troubleshooting.md asks for in a bug report:

Service version and commit (pip show kneo-serv or the image tag).
Environment context (uname -a, Python version, Postgres version).
The request_id and run_id of an affected request.
GET /readyz body, even when it returns 200.
For run-shaped problems: GET /v1/runs/{run_id}, GET /v1/runs/{run_id}/trace, and GET /v1/runs/{run_id}/checkpoints (redacted output is fine to share).
For spec-shaped problems: the output of kneo spec validate <path> --json.
Recent audit events that mention the affected resource: GET /v1/audit-events?run_id=<id>.

When to roll back¶

If the incident started immediately after a deploy and isn't covered by the matrix above:

Confirm the deploy is the trigger — diff the running version against the previous version, check the time correlation with the first 503/5xx.
If yes, follow backup_and_recovery.md § Rolling back after a failed upgrade.

Rolling back when the trigger isn't the deploy throws away forward progress. Diagnose first.

What this page does not cover¶

Severity definitions and paging policy. Owned by your on-call rota, not by kneo-serv.
Post-incident review template. Out of scope.
Failure modes during a kneo-serv release itself — those are in release_checklist.md and troubleshooting.md § 9.