Troubleshooting¶
An operator-facing runbook indexed by symptom. Each entry lists the symptom, how to confirm the root cause, and the fix; cross-references point at the authoritative configuration doc when one exists.
If you are responding to a live incident and don't yet have a symptom,
start at incident_response.md — it walks
/healthz → /readyz → the right section here. This page is the
symptom-indexed deep dive.
When you're not sure where to start, check GET /readyz
(§ 1.2) — it exposes the per-dependency status the
service uses internally, and most "service is unhealthy" tickets resolve to
one of its check entries.
1. Service won't start or won't accept traffic¶
1.1 RuntimeError: KNEO service auth is enabled but no API keys are configured¶
The service refuses to start when auth is enabled without keys.
service/auth.py
- Confirm: check the startup log for the message above.
- Fix: set
KNEO_SERV_API_KEYS(entries arename:key:role_or_scope[,role_or_scope], semicolon-separated) and/orKNEO_SERV_ADMIN_API_KEY. To run without auth, setKNEO_SERV_AUTH_ENABLED=false(only for local dev). - Reference: environment.md § Service Auth, production_readiness_review.md § Role Boundary Review.
1.2 /readyz returns 503¶
GET /readyz returns
{"error": "not_ready", "metadata": {"checks": {...}}} when any dependency
check fails. service/routes_health.py
- Confirm:
curl -sf http://<host>:<port>/readyz | jq. Each per-dependency entry hasok: falsepluserrorandmessagefor failed checks. - Fix: the per-check failure matrix (which check maps to which recovery
action) lives in
incident_response.md § /readyz failure matrix. In summary: store failures are covered by §2; provider/MCP secret failures by §3.
1.3 RuntimeError: PlatformManager has not been configured¶
The default app factory configures the platform manager automatically. This
error appears when you pass configure_default_manager=False to
create_app() and never call set_platform_manager() before serving.
service/dependencies.py
- Fix: either drop the override, or call
kneo_serv.service.dependencies.set_platform_manager(...)before the first request.
1.4 RuntimeError: Invalid KNEO_SERV_API_KEYS entry¶
The format is name:key:role_or_scope[,role_or_scope] per entry, separated
by semicolons. Whitespace inside entries is trimmed; missing colons trigger
this error. service/auth.py
- Fix: re-render
KNEO_SERV_API_KEYS. Examples: operator:OP_TOKEN:operator;reviewer:REV_TOKEN:reviewersvc:SVC_TOKEN:runs:write,human:read,human:write- Reference: environment.md § Service Auth, production_readiness_review.md § Route Scope Matrix.
2. Persistence and store failures¶
2.1 PostgreSQL DSN configured but service falls back to SQLite¶
The service uses PostgreSQL only when KNEO_SERV_DATABASE_URL is set and
the [postgres] or [deploy] extra is installed. See
service/factory.py.
- Confirm: in a dev shell,
python -c "import psycopg; print(psycopg.__version__)". - Fix: install
kneo-serv[deploy](Docker image already does this), orkneo-serv[postgres]if you don't need telemetry.
2.2 psycopg.OperationalError on startup or first request¶
The DSN can't connect. Common causes: wrong host, missing TLS, wrong credentials, database not yet created.
- Confirm:
psql "$KNEO_SERV_DATABASE_URL" -c '\dt'from the same network context as the service. - Fix: correct the DSN, ensure the database exists, and confirm the user has
privileges.
KNEO_SERV_DATABASE_URLmust be a libpq-style URL.
2.3 SQLite database is locked errors¶
Concurrent writes to a single SQLite file can collide. The default service worker is single-threaded per process; this typically appears when running multiple service processes against the same SQLite file.
- Fix: switch to PostgreSQL (set
KNEO_SERV_DATABASE_URL). Multi-process SQLite is not a supported deployment topology.
2.4 Schema migration appears to have run but old data is missing¶
Migrations are idempotent and version-tracked per store; they don't drop data. If rows look missing after an upgrade, check whether you actually upgraded the same database the service is reading.
- Confirm: compare
KNEO_SERV_DATABASE_URL(or SQLite path) between the upgrade context and the running service. A stale state file at.kneo/kneo_runs.sqliteis a common cause.
2.5 Backup/restore mismatch¶
kneo_serv.maintenance.backup_sqlite_database() produces a file copy that
restore expects to find on the same SQLite version line. Restoring across
incompatible SQLite versions can fail.
- Fix: align
sqlite3versions, or migrate to PostgreSQL where backup goes throughpg_dump/pg_restore. See staging_release_runbook.md and release_checklist.md for the seeded recovery drill.
3. Secrets, credentials, and provider integration¶
3.1 MissingSecretError on agent run¶
Provider keys, MCP credentials, and runtime settings are resolved through
env-var references in project config; raw values are never stored.
security/secrets.py
- Confirm:
kneo config secrets --jsonlists which references exist and whether each resolves. The endpointGET /security/credentialsexposes the same view (requirescredentials:read). - Fix: export the env var named in the error. Set
KNEO_SERV_REQUIRE_PROVIDER_SECRETS=trueto fail fast at startup instead of at first run.
3.2 GET /readyz reports missing provider/MCP secrets¶
Readiness reports the secrets named in KNEO_SERV_HEALTH_PROVIDERS and
KNEO_SERV_HEALTH_MCP_SECRETS. These are operator-curated allowlists, so
expect 503 if you list a secret that isn't actually exported.
- Fix: trim the list to secrets you actually use, or export the missing one.
3.3 kneo spec bundle verify fails¶
Bundle verification requires KNEO_SERV_SPEC_SIGNING_KEY to match the key
used to sign. Bundles signed with a different key (or unsigned) fail
verification.
- Fix: rotate the signing key consistently across signing and verifying hosts. The key is HMAC-only; do not commit it.
4. Authentication and authorization¶
4.1 401 Unauthorized — A valid Kneo service API key is required¶
The route requires auth and the request didn't carry a valid token.
service/auth.py
- Confirm: send
Authorization: Bearer <key>orX-Kneo-Api-Key: <key>. - Fix: use one of the configured keys. The CLI service client reads
KNEO_SERV_API_KEY. For multi-environment workflows use CLI profiles (kneo config profile use ...).
4.2 403 Forbidden — Missing required scope: <scope>¶
The token authenticated but the principal does not hold the scope the route
requires. service/auth.py
- Confirm: the scope in the error body tells you exactly what is missing.
- Fix: assign the principal a role that includes the scope, or add the
scope explicitly in
KNEO_SERV_API_KEYS. See the route ↔ scope matrix in production_readiness_review.md. - Common gotchas:
POST /specs/runrequiresruns:write, notspecs:read.- Reviewer cannot create runs or change policies.
- Service role cannot mutate environment policies (only operator/admin).
4.3 Health endpoints work, all other routes 401¶
/healthz, /livez, and /readyz are intentionally unauthenticated for
load-balancer probes. Everything else is gated by the auth dependency. This
is by design; see
production_readiness_review.md § Route Scope Matrix.
5. Run lifecycle problems¶
5.1 Async runs sit in queued and never progress¶
The platform worker is started by create_default_platform_manager() in
service/factory.py. If a custom
embedding skips manager.start_worker(), queued runs never drain.
- Confirm:
GET /runs?status=queuedshows queued items;GET /readyzshows the queue dependency as ok; the service log has no "worker" lines. - Fix: ensure
start_worker()is called in the host process after constructingPlatformManagerdirectly, or use the default factory.
5.2 Cancelled run still finishes as succeeded¶
Cancellation is cooperative through CancellationToken and propagates only
at unit-of-work boundaries. A step that completes between the cancel
request and the next checkpoint will record its result, but RunState
remains cancelled — the platform does not overwrite cancelled status with
completed results.
- Confirm:
GET /runs/{run_id}should still reportstatus: cancelledeven if the last checkpoint shows completion of a step. - If
statusshowssucceededafter a cancel, file an issue with the run id, the checkpoint timeline (/runs/{run_id}/checkpoints), and the trace (/runs/{run_id}/trace).
5.3 Run hangs at a workflow step¶
Workflow steps support on_error: retry, max_retries, and
timeout_seconds. If a step has no timeout and the underlying provider/MCP
call blocks, the step blocks too.
- Fix: set step-level timeouts, or set the global defaults
KNEO_SERV_PROVIDER_TIMEOUT_SECONDS/KNEO_SERV_MCP_TIMEOUT_SECONDS.
5.4 409 Conflict — idempotency_key_conflict¶
Idempotency-Key was reused with a different request body for the same
scope. service/idempotency.py
- Fix: pick a new key for the new request body, or reuse the same body for the original key. Idempotency records hash the canonical JSON of the request payload and replay the original response on match.
5.5 400 Bad Request — invalid_idempotency_key¶
Idempotency-Key headers must be 1–256 characters after trimming.
service/idempotency.py
- Fix: shorten the key. UUIDs are sufficient.
6. Spec validation and compilation¶
6.1 SpecCompilationError with diagnostics¶
SpecCompiler raises this on either schema or semantic validation failure.
spec/compiler.py
- Confirm:
kneo spec validate <path>prints the same diagnostics with location info. - Fix: address each diagnostic. Common causes:
- Missing/extra fields in
version: v1shape. - References to undefined components/tools.
- Memory/guardrail policy mis-shape.
- Migrate older specs with
kneo spec migrate <path> --output <new>.
6.2 ValueError: Tool '<name>' has no implementation¶
The spec references a tool name that is not registered with the
ToolRegistry. spec/builder.py
- Fix: register the tool (programmatically or via MCP import), or remove
the reference. The default service registers example tools; toggle with
include_example_toolswhen constructingPlatformManagerdirectly.
6.3 Inline spec rejected with size error¶
Inline specs and overrides are bounded.
service/limits.py
- Fix: tune the relevant limit (
KNEO_SERV_MAX_INLINE_SPEC_BYTES,KNEO_SERV_MAX_OVERRIDES_BYTES,KNEO_SERV_MAX_METADATA_BYTES,KNEO_SERV_MAX_BODY_BYTES) or move the spec to a path on disk. Limits exist to keep the service from deserializing arbitrarily large payloads.
7. Observability¶
7.1 Structured logs missing request_id¶
The structured logging middleware always populates request_id; missing
fields usually mean the log was emitted before RequestLoggingMiddleware
attached, or you're reading raw uvicorn access logs instead of the service
logger.
- Fix: filter by logger name
kneo_servor by JSON log format. Clients can supplyX-Request-IDto override the generated id; the service echoes it on the response.
7.2 OpenTelemetry not exporting¶
The SDK OpenTelemetryMiddleware only attaches when both
KNEO_SERV_OTEL_ENABLED=true and the [telemetry] extra is installed.
- Fix: install
kneo-serv[deploy](Docker image) orkneo-serv[telemetry], setKNEO_SERV_OTEL_ENABLED=true, and ensure standard OTel exporter env vars (OTEL_EXPORTER_OTLP_ENDPOINT, etc.) are set. - Tool arguments and results are not captured by default. Enable with
KNEO_SERV_OTEL_RECORD_ARGUMENTS=trueand/orKNEO_SERV_OTEL_RECORD_RESULTS=trueonly after you've confirmed the data classification allows payload capture.
7.3 Trace events missing for a run¶
Service-side trace events live in run metadata and at
/runs/{run_id}/trace. They are emitted by TracingMiddleware and the
in-process Tracer, independent of OTel. Missing events most often mean
the run was never executed (e.g. queued and abandoned) or the spec
disabled tracing.
- Fix: confirm the run reached
running/succeeded. If the workflow middleware list omitsTracingMiddleware, restore it (the default chain includes it).
8. Human-in-the-loop¶
8.1 LockAcquisitionError on resume¶
A POST /human-tasks/{continuation_id}/resume failed because another
caller currently holds the resume lock for the same continuation.
platform/manager.py
- Confirm: the error body identifies the lock name. The first caller is still in flight.
- Fix: wait for the in-flight resume to complete; do not retry blindly. Use idempotency keys on resume to make retries safe.
8.2 Continuation expired or missing¶
If the continuation store was rotated (e.g. .kneo/continuations
recreated, or PostgreSQL row deleted), /human-tasks/{continuation_id}
returns 404.
- Fix: the run cannot be resumed. Start a new run.
9. Release and supply chain¶
For release-flow issues (mypy, pip-audit, build, tag, publish), follow
release_checklist.md and
supply_chain_review.md. The release workflow at
.github/workflows/release.yml emits the
gate that failed in its job summary.
What to capture before opening a bug¶
When a problem isn't covered above:
- Service version and commit (
pip show kneo-serv, plus the git commit if installed from source). - Environment context (
uname -a, Python version, Postgres version if used). - The
request_idandrun_idfrom logs. GET /readyzbody.- For run problems:
GET /runs/{run_id},GET /runs/{run_id}/trace, andGET /runs/{run_id}/checkpoints(redacted output is fine to share). - For spec problems: the output of
kneo spec validate <path> --json.
Audit events are accessible at GET /audit-events and frequently contain
the operator action that preceded a fault.