Skip to content

Incident response

On-call entry point for kneo-serv. This page is the triage tree — "the service looks wrong, where do I look first?" The symptom-by-symptom deep dive lives in troubleshooting.md; this page sends you there.

For backup and rollback procedures, see backup_and_recovery.md. For the API definition of the health endpoints, see service_api.md § Health checks.

Triage tree

   ┌── /healthz returns 200 ───── service process is alive
   │                              ─ check /readyz next
1. │── /healthz times out ─────── process is down or unreachable
   │                              ─ check the container / supervisor
   │                              ─ check the reverse proxy upstream
   └── /healthz 5xx ──────────────  application crashed mid-request
                                   ─ check stderr / container logs
                                   ─ see troubleshooting.md § 1

   ┌── /readyz returns 200 ────── dependencies healthy
   │                              ─ problem is in a specific run/spec
   │                              ─ check the run path below
2. │── /readyz returns 503 ──────  read the `metadata.checks` payload
   │                              ─ use the matrix below to find the
   │                                failing dependency, then jump to
   │                                the matching troubleshooting § n
   └── /readyz times out ────────  same path as /healthz timeout above

/healthz and /readyz are unauthenticated by design — you can probe them from anywhere you can reach the service port.

curl -sf http://<host>:<port>/healthz | jq
curl -sf http://<host>:<port>/readyz  | jq '.metadata.checks'

/readyz failure matrix

/readyz runs eight checks; when any fails, the response is 503 with {"error": "not_ready", "metadata": {"checks": {...}}}. The keys you will see and what each means: (kneo_serv/service/routes_health.py)

Check key What it probes Common failure Where to go
api API wiring sentinel Never fails on its own If it does, the manager isn't configured — § 1.3
run_state_store manager.run_state_store.list_runs(limit=1) succeeds DB unreachable, schema missing, credentials wrong § 2.1 / § 2.2
continuation_store manager.continuation_store.list() succeeds File path missing, permissions wrong, DB issue § 2
queue list_queued_runs(status=…) returns for queued / running / failed Queue table missing or DB stall § 2, § 5.1
runtime_registry Number of registered runtimes (declared via factories) Empty registry — service started without runtimes extending.md for runtime registration
tool_registry Number of registered tools Empty registry — service started without tools extending.md
providers Secrets named in KNEO_SERV_HEALTH_PROVIDERS resolve Provider env var missing or empty § 3.2
mcp Secrets named in KNEO_SERV_HEALTH_MCP_SECRETS resolve MCP secret missing § 3.2

The payload includes per-check error (exception class) and message for failed checks, so you usually don't have to guess which subsystem is at fault — copy the message into the relevant troubleshooting section.

Common production incidents

If /healthz and /readyz are both green but the service is "wrong":

Symptom First check Deep dive
All requests return 401 Authorization / X-Kneo-Api-Key header is present and valid § 4.1, § 4.3
Specific consumer returns 403 The key's role/scope covers the route § 4.2, security_hardening.md § 2
Async runs stuck in queued Worker process is up; queue table reachable § 5.1
Runs hang mid-workflow The step's tool or provider call is timing out § 5.3
409 idempotency_key_conflict Caller is reusing a key with a different payload § 5.4
Tool reports MissingSecretError Provider secret env var is set on the service host § 3.1
Logs missing request_id You are reading the right logger (kneo_serv.service), not raw uvicorn § 7.1, observability.md
OpenTelemetry exporter silent KNEO_SERV_OTEL_ENABLED=true and [telemetry] extra installed § 7.2, observability.md
Human task 409 resource_locked Another resume is in flight for the same continuation § 8.1
Restored backup but state looks stale or mismatched Stop, re-verify the dump source, follow the recovery shape § 2.5, backup_and_recovery.md

What to capture before escalating

When the runbook doesn't have an entry that fits, capture this context before paging the on-call developer. It is the same set troubleshooting.md asks for in a bug report:

  • Service version and commit (pip show kneo-serv or the image tag).
  • Environment context (uname -a, Python version, Postgres version).
  • The request_id and run_id of an affected request.
  • GET /readyz body, even when it returns 200.
  • For run-shaped problems: GET /v1/runs/{run_id}, GET /v1/runs/{run_id}/trace, and GET /v1/runs/{run_id}/checkpoints (redacted output is fine to share).
  • For spec-shaped problems: the output of kneo spec validate <path> --json.
  • Recent audit events that mention the affected resource: GET /v1/audit-events?run_id=<id>.

When to roll back

If the incident started immediately after a deploy and isn't covered by the matrix above:

  1. Confirm the deploy is the trigger — diff the running version against the previous version, check the time correlation with the first 503/5xx.
  2. If yes, follow backup_and_recovery.md § Rolling back after a failed upgrade.

Rolling back when the trigger isn't the deploy throws away forward progress. Diagnose first.

What this page does not cover

  • Severity definitions and paging policy. Owned by your on-call rota, not by kneo-serv.
  • Post-incident review template. Out of scope.
  • Failure modes during a kneo-serv release itself — those are in release_checklist.md and troubleshooting.md § 9.