Skip to content

Troubleshooting

An operator-facing runbook indexed by symptom. Each entry lists the symptom, how to confirm the root cause, and the fix; cross-references point at the authoritative configuration doc when one exists.

If you are responding to a live incident and don't yet have a symptom, start at incident_response.md — it walks /healthz/readyz → the right section here. This page is the symptom-indexed deep dive.

When you're not sure where to start, check GET /readyz (§ 1.2) — it exposes the per-dependency status the service uses internally, and most "service is unhealthy" tickets resolve to one of its check entries.

1. Service won't start or won't accept traffic

1.1 RuntimeError: KNEO service auth is enabled but no API keys are configured

The service refuses to start when auth is enabled without keys. service/auth.py

  • Confirm: check the startup log for the message above.
  • Fix: set KNEO_SERV_API_KEYS (entries are name:key:role_or_scope[,role_or_scope], semicolon-separated) and/or KNEO_SERV_ADMIN_API_KEY. To run without auth, set KNEO_SERV_AUTH_ENABLED=false (only for local dev).
  • Reference: environment.md § Service Auth, production_readiness_review.md § Role Boundary Review.

1.2 /readyz returns 503

GET /readyz returns {"error": "not_ready", "metadata": {"checks": {...}}} when any dependency check fails. service/routes_health.py

  • Confirm: curl -sf http://<host>:<port>/readyz | jq. Each per-dependency entry has ok: false plus error and message for failed checks.
  • Fix: the per-check failure matrix (which check maps to which recovery action) lives in incident_response.md § /readyz failure matrix. In summary: store failures are covered by §2; provider/MCP secret failures by §3.

1.3 RuntimeError: PlatformManager has not been configured

The default app factory configures the platform manager automatically. This error appears when you pass configure_default_manager=False to create_app() and never call set_platform_manager() before serving. service/dependencies.py

  • Fix: either drop the override, or call kneo_serv.service.dependencies.set_platform_manager(...) before the first request.

1.4 RuntimeError: Invalid KNEO_SERV_API_KEYS entry

The format is name:key:role_or_scope[,role_or_scope] per entry, separated by semicolons. Whitespace inside entries is trimmed; missing colons trigger this error. service/auth.py

  • Fix: re-render KNEO_SERV_API_KEYS. Examples:
  • operator:OP_TOKEN:operator;reviewer:REV_TOKEN:reviewer
  • svc:SVC_TOKEN:runs:write,human:read,human:write
  • Reference: environment.md § Service Auth, production_readiness_review.md § Route Scope Matrix.

2. Persistence and store failures

2.1 PostgreSQL DSN configured but service falls back to SQLite

The service uses PostgreSQL only when KNEO_SERV_DATABASE_URL is set and the [postgres] or [deploy] extra is installed. See service/factory.py.

  • Confirm: in a dev shell, python -c "import psycopg; print(psycopg.__version__)".
  • Fix: install kneo-serv[deploy] (Docker image already does this), or kneo-serv[postgres] if you don't need telemetry.

2.2 psycopg.OperationalError on startup or first request

The DSN can't connect. Common causes: wrong host, missing TLS, wrong credentials, database not yet created.

  • Confirm: psql "$KNEO_SERV_DATABASE_URL" -c '\dt' from the same network context as the service.
  • Fix: correct the DSN, ensure the database exists, and confirm the user has privileges. KNEO_SERV_DATABASE_URL must be a libpq-style URL.

2.3 SQLite database is locked errors

Concurrent writes to a single SQLite file can collide. The default service worker is single-threaded per process; this typically appears when running multiple service processes against the same SQLite file.

  • Fix: switch to PostgreSQL (set KNEO_SERV_DATABASE_URL). Multi-process SQLite is not a supported deployment topology.

2.4 Schema migration appears to have run but old data is missing

Migrations are idempotent and version-tracked per store; they don't drop data. If rows look missing after an upgrade, check whether you actually upgraded the same database the service is reading.

  • Confirm: compare KNEO_SERV_DATABASE_URL (or SQLite path) between the upgrade context and the running service. A stale state file at .kneo/kneo_runs.sqlite is a common cause.

2.5 Backup/restore mismatch

kneo_serv.maintenance.backup_sqlite_database() produces a file copy that restore expects to find on the same SQLite version line. Restoring across incompatible SQLite versions can fail.

  • Fix: align sqlite3 versions, or migrate to PostgreSQL where backup goes through pg_dump / pg_restore. See staging_release_runbook.md and release_checklist.md for the seeded recovery drill.

3. Secrets, credentials, and provider integration

3.1 MissingSecretError on agent run

Provider keys, MCP credentials, and runtime settings are resolved through env-var references in project config; raw values are never stored. security/secrets.py

  • Confirm: kneo config secrets --json lists which references exist and whether each resolves. The endpoint GET /security/credentials exposes the same view (requires credentials:read).
  • Fix: export the env var named in the error. Set KNEO_SERV_REQUIRE_PROVIDER_SECRETS=true to fail fast at startup instead of at first run.

3.2 GET /readyz reports missing provider/MCP secrets

Readiness reports the secrets named in KNEO_SERV_HEALTH_PROVIDERS and KNEO_SERV_HEALTH_MCP_SECRETS. These are operator-curated allowlists, so expect 503 if you list a secret that isn't actually exported.

  • Fix: trim the list to secrets you actually use, or export the missing one.

3.3 kneo spec bundle verify fails

Bundle verification requires KNEO_SERV_SPEC_SIGNING_KEY to match the key used to sign. Bundles signed with a different key (or unsigned) fail verification.

  • Fix: rotate the signing key consistently across signing and verifying hosts. The key is HMAC-only; do not commit it.

4. Authentication and authorization

4.1 401 Unauthorized — A valid Kneo service API key is required

The route requires auth and the request didn't carry a valid token. service/auth.py

  • Confirm: send Authorization: Bearer <key> or X-Kneo-Api-Key: <key>.
  • Fix: use one of the configured keys. The CLI service client reads KNEO_SERV_API_KEY. For multi-environment workflows use CLI profiles (kneo config profile use ...).

4.2 403 Forbidden — Missing required scope: <scope>

The token authenticated but the principal does not hold the scope the route requires. service/auth.py

  • Confirm: the scope in the error body tells you exactly what is missing.
  • Fix: assign the principal a role that includes the scope, or add the scope explicitly in KNEO_SERV_API_KEYS. See the route ↔ scope matrix in production_readiness_review.md.
  • Common gotchas:
  • POST /specs/run requires runs:write, not specs:read.
  • Reviewer cannot create runs or change policies.
  • Service role cannot mutate environment policies (only operator/admin).

4.3 Health endpoints work, all other routes 401

/healthz, /livez, and /readyz are intentionally unauthenticated for load-balancer probes. Everything else is gated by the auth dependency. This is by design; see production_readiness_review.md § Route Scope Matrix.

5. Run lifecycle problems

5.1 Async runs sit in queued and never progress

The platform worker is started by create_default_platform_manager() in service/factory.py. If a custom embedding skips manager.start_worker(), queued runs never drain.

  • Confirm: GET /runs?status=queued shows queued items; GET /readyz shows the queue dependency as ok; the service log has no "worker" lines.
  • Fix: ensure start_worker() is called in the host process after constructing PlatformManager directly, or use the default factory.

5.2 Cancelled run still finishes as succeeded

Cancellation is cooperative through CancellationToken and propagates only at unit-of-work boundaries. A step that completes between the cancel request and the next checkpoint will record its result, but RunState remains cancelled — the platform does not overwrite cancelled status with completed results.

  • Confirm: GET /runs/{run_id} should still report status: cancelled even if the last checkpoint shows completion of a step.
  • If status shows succeeded after a cancel, file an issue with the run id, the checkpoint timeline (/runs/{run_id}/checkpoints), and the trace (/runs/{run_id}/trace).

5.3 Run hangs at a workflow step

Workflow steps support on_error: retry, max_retries, and timeout_seconds. If a step has no timeout and the underlying provider/MCP call blocks, the step blocks too.

  • Fix: set step-level timeouts, or set the global defaults KNEO_SERV_PROVIDER_TIMEOUT_SECONDS / KNEO_SERV_MCP_TIMEOUT_SECONDS.

5.4 409 Conflict — idempotency_key_conflict

Idempotency-Key was reused with a different request body for the same scope. service/idempotency.py

  • Fix: pick a new key for the new request body, or reuse the same body for the original key. Idempotency records hash the canonical JSON of the request payload and replay the original response on match.

5.5 400 Bad Request — invalid_idempotency_key

Idempotency-Key headers must be 1–256 characters after trimming. service/idempotency.py

  • Fix: shorten the key. UUIDs are sufficient.

6. Spec validation and compilation

6.1 SpecCompilationError with diagnostics

SpecCompiler raises this on either schema or semantic validation failure. spec/compiler.py

  • Confirm: kneo spec validate <path> prints the same diagnostics with location info.
  • Fix: address each diagnostic. Common causes:
  • Missing/extra fields in version: v1 shape.
  • References to undefined components/tools.
  • Memory/guardrail policy mis-shape.
  • Migrate older specs with kneo spec migrate <path> --output <new>.

6.2 ValueError: Tool '<name>' has no implementation

The spec references a tool name that is not registered with the ToolRegistry. spec/builder.py

  • Fix: register the tool (programmatically or via MCP import), or remove the reference. The default service registers example tools; toggle with include_example_tools when constructing PlatformManager directly.

6.3 Inline spec rejected with size error

Inline specs and overrides are bounded. service/limits.py

  • Fix: tune the relevant limit (KNEO_SERV_MAX_INLINE_SPEC_BYTES, KNEO_SERV_MAX_OVERRIDES_BYTES, KNEO_SERV_MAX_METADATA_BYTES, KNEO_SERV_MAX_BODY_BYTES) or move the spec to a path on disk. Limits exist to keep the service from deserializing arbitrarily large payloads.

7. Observability

7.1 Structured logs missing request_id

The structured logging middleware always populates request_id; missing fields usually mean the log was emitted before RequestLoggingMiddleware attached, or you're reading raw uvicorn access logs instead of the service logger.

  • Fix: filter by logger name kneo_serv or by JSON log format. Clients can supply X-Request-ID to override the generated id; the service echoes it on the response.

7.2 OpenTelemetry not exporting

The SDK OpenTelemetryMiddleware only attaches when both KNEO_SERV_OTEL_ENABLED=true and the [telemetry] extra is installed.

  • Fix: install kneo-serv[deploy] (Docker image) or kneo-serv[telemetry], set KNEO_SERV_OTEL_ENABLED=true, and ensure standard OTel exporter env vars (OTEL_EXPORTER_OTLP_ENDPOINT, etc.) are set.
  • Tool arguments and results are not captured by default. Enable with KNEO_SERV_OTEL_RECORD_ARGUMENTS=true and/or KNEO_SERV_OTEL_RECORD_RESULTS=true only after you've confirmed the data classification allows payload capture.

7.3 Trace events missing for a run

Service-side trace events live in run metadata and at /runs/{run_id}/trace. They are emitted by TracingMiddleware and the in-process Tracer, independent of OTel. Missing events most often mean the run was never executed (e.g. queued and abandoned) or the spec disabled tracing.

  • Fix: confirm the run reached running/succeeded. If the workflow middleware list omits TracingMiddleware, restore it (the default chain includes it).

8. Human-in-the-loop

8.1 LockAcquisitionError on resume

A POST /human-tasks/{continuation_id}/resume failed because another caller currently holds the resume lock for the same continuation. platform/manager.py

  • Confirm: the error body identifies the lock name. The first caller is still in flight.
  • Fix: wait for the in-flight resume to complete; do not retry blindly. Use idempotency keys on resume to make retries safe.

8.2 Continuation expired or missing

If the continuation store was rotated (e.g. .kneo/continuations recreated, or PostgreSQL row deleted), /human-tasks/{continuation_id} returns 404.

  • Fix: the run cannot be resumed. Start a new run.

9. Release and supply chain

For release-flow issues (mypy, pip-audit, build, tag, publish), follow release_checklist.md and supply_chain_review.md. The release workflow at .github/workflows/release.yml emits the gate that failed in its job summary.

What to capture before opening a bug

When a problem isn't covered above:

  • Service version and commit (pip show kneo-serv, plus the git commit if installed from source).
  • Environment context (uname -a, Python version, Postgres version if used).
  • The request_id and run_id from logs.
  • GET /readyz body.
  • For run problems: GET /runs/{run_id}, GET /runs/{run_id}/trace, and GET /runs/{run_id}/checkpoints (redacted output is fine to share).
  • For spec problems: the output of kneo spec validate <path> --json.

Audit events are accessible at GET /audit-events and frequently contain the operator action that preceded a fault.