Incident response¶
On-call entry point for kneo-serv. This page is the triage tree —
"the service looks wrong, where do I look first?" The symptom-by-symptom
deep dive lives in troubleshooting.md; this page
sends you there.
For backup and rollback procedures, see
backup_and_recovery.md. For the API
definition of the health endpoints, see
service_api.md § Health checks.
Triage tree¶
┌── /healthz returns 200 ───── service process is alive
│ ─ check /readyz next
│
1. │── /healthz times out ─────── process is down or unreachable
│ ─ check the container / supervisor
│ ─ check the reverse proxy upstream
│
└── /healthz 5xx ────────────── application crashed mid-request
─ check stderr / container logs
─ see troubleshooting.md § 1
┌── /readyz returns 200 ────── dependencies healthy
│ ─ problem is in a specific run/spec
│ ─ check the run path below
│
2. │── /readyz returns 503 ────── read the `metadata.checks` payload
│ ─ use the matrix below to find the
│ failing dependency, then jump to
│ the matching troubleshooting § n
│
└── /readyz times out ──────── same path as /healthz timeout above
/healthz and /readyz are unauthenticated by design — you can probe
them from anywhere you can reach the service port.
curl -sf http://<host>:<port>/healthz | jq
curl -sf http://<host>:<port>/readyz | jq '.metadata.checks'
/readyz failure matrix¶
/readyz runs eight checks; when any fails, the response is 503 with
{"error": "not_ready", "metadata": {"checks": {...}}}. The keys you
will see and what each means:
(kneo_serv/service/routes_health.py)
| Check key | What it probes | Common failure | Where to go |
|---|---|---|---|
api |
API wiring sentinel | Never fails on its own | If it does, the manager isn't configured — § 1.3 |
run_state_store |
manager.run_state_store.list_runs(limit=1) succeeds |
DB unreachable, schema missing, credentials wrong | § 2.1 / § 2.2 |
continuation_store |
manager.continuation_store.list() succeeds |
File path missing, permissions wrong, DB issue | § 2 |
queue |
list_queued_runs(status=…) returns for queued / running / failed |
Queue table missing or DB stall | § 2, § 5.1 |
runtime_registry |
Number of registered runtimes (declared via factories) | Empty registry — service started without runtimes | extending.md for runtime registration |
tool_registry |
Number of registered tools | Empty registry — service started without tools | extending.md |
providers |
Secrets named in KNEO_SERV_HEALTH_PROVIDERS resolve |
Provider env var missing or empty | § 3.2 |
mcp |
Secrets named in KNEO_SERV_HEALTH_MCP_SECRETS resolve |
MCP secret missing | § 3.2 |
The payload includes per-check error (exception class) and message
for failed checks, so you usually don't have to guess which subsystem
is at fault — copy the message into the relevant troubleshooting
section.
Common production incidents¶
If /healthz and /readyz are both green but the service is "wrong":
| Symptom | First check | Deep dive |
|---|---|---|
All requests return 401 |
Authorization / X-Kneo-Api-Key header is present and valid |
§ 4.1, § 4.3 |
Specific consumer returns 403 |
The key's role/scope covers the route | § 4.2, security_hardening.md § 2 |
Async runs stuck in queued |
Worker process is up; queue table reachable | § 5.1 |
| Runs hang mid-workflow | The step's tool or provider call is timing out | § 5.3 |
409 idempotency_key_conflict |
Caller is reusing a key with a different payload | § 5.4 |
Tool reports MissingSecretError |
Provider secret env var is set on the service host | § 3.1 |
Logs missing request_id |
You are reading the right logger (kneo_serv.service), not raw uvicorn |
§ 7.1, observability.md |
| OpenTelemetry exporter silent | KNEO_SERV_OTEL_ENABLED=true and [telemetry] extra installed |
§ 7.2, observability.md |
Human task 409 resource_locked |
Another resume is in flight for the same continuation | § 8.1 |
| Restored backup but state looks stale or mismatched | Stop, re-verify the dump source, follow the recovery shape | § 2.5, backup_and_recovery.md |
What to capture before escalating¶
When the runbook doesn't have an entry that fits, capture this context
before paging the on-call developer. It is the same set
troubleshooting.md
asks for in a bug report:
- Service version and commit (
pip show kneo-servor the image tag). - Environment context (
uname -a, Python version, Postgres version). - The
request_idandrun_idof an affected request. GET /readyzbody, even when it returns200.- For run-shaped problems:
GET /v1/runs/{run_id},GET /v1/runs/{run_id}/trace, andGET /v1/runs/{run_id}/checkpoints(redacted output is fine to share). - For spec-shaped problems: the output of
kneo spec validate <path> --json. - Recent audit events that mention the affected resource:
GET /v1/audit-events?run_id=<id>.
When to roll back¶
If the incident started immediately after a deploy and isn't covered by the matrix above:
- Confirm the deploy is the trigger — diff the running version against the previous version, check the time correlation with the first 503/5xx.
- If yes, follow
backup_and_recovery.md § Rolling back after a failed upgrade.
Rolling back when the trigger isn't the deploy throws away forward progress. Diagnose first.
What this page does not cover¶
- Severity definitions and paging policy. Owned by your on-call rota,
not by
kneo-serv. - Post-incident review template. Out of scope.
- Failure modes during a
kneo-servrelease itself — those are inrelease_checklist.mdandtroubleshooting.md § 9.