Observability¶
Operator guide for wiring kneo-serv's structured logs, request tracing,
and OpenTelemetry exports into a production observability stack.
This page is the setup view. For symptoms and recovery when observability
itself misbehaves, see
troubleshooting.md § 7. For the
full env-var list, see
environment.md § Observability.
Three signals, three surfaces¶
| Signal | What it is | Where it comes from |
|---|---|---|
| Structured request logs | One JSON record per HTTP request, redacted | RequestLoggingMiddleware (always on by default) |
| Service-side trace events | Per-run trace and checkpoint records, queryable via the API | TracingMiddleware, exposed at /v1/runs/{run_id}/trace |
| OpenTelemetry spans | Distributed-tracing spans across SDK-driven agent / tool calls and platform-side operations (queue dispatch, worker lease, continuation lock) | kneo_agent.observability.OpenTelemetryMiddleware + kneo_serv.observability.platform_tracer, opt-in via KNEO_SERV_OTEL_ENABLED |
There is no metrics endpoint in the 0.4.x line. Scrape latency and error rate from your reverse-proxy access logs or from OTel span attributes.
Structured request logs¶
Shape¶
Each request emits a single JSON record on the kneo_serv.service logger:
{
"client_ip": "10.0.0.7",
"duration_ms": 18.214,
"event": "http_request",
"method": "POST",
"path": "/v1/runs",
"request_id": "f3b3…",
"run_id": "run_…",
"status_code": 201,
"user_agent": "kneo-serv-client/0.2.2"
}
Fields the middleware always emits: event, request_id, method,
path, status_code, duration_ms. Optional fields when available:
client_ip, user_agent. Route-derived fields when the path includes
them: run_id, continuation_id. When the request raises:
error (exception class name) and message (exception message).
Redaction is applied to every payload before it reaches the log line.
(kneo_serv/observability/structured_logging.py)
Configuration¶
| Variable | Default | Purpose |
|---|---|---|
KNEO_SERV_REQUEST_LOGS |
true |
Enable the JSON request log middleware. |
KNEO_SERV_LOG_LEVEL |
INFO |
Service logger level. |
request_id is generated server-side as a UUID unless the client sends
X-Request-ID; either way the service echoes it back on the response
header.
Production tuning¶
- Keep
KNEO_SERV_LOG_LEVEL=INFOin production.DEBUGdoubles log volume and can leak diagnostic payloads from middleware that wraps the request logger. - Configure your container runtime's log driver (Docker
json-filewith rotation, Kuberneteskubectl logsrotation, journald) — the service writes to stdout and relies on the runtime to rotate.
Log aggregation wiring¶
- ELK / OpenSearch. Ship stdout via Filebeat or Vector. The records
are already JSON; map
request_idandrun_idas indexed fields. Pinservice.name=kneo-servfrom the shipper for cross-deployment search. - Loki. A Promtail pipeline with a
jsonstage will liftrequest_id,run_id,status_code, andduration_msto labels. Keep label cardinality bounded — don't promoterequest_idto a Loki label, query it as content. - Cloud-managed (CloudWatch Logs, GCP Logging). Forward stdout; the managed pipeline parses JSON automatically.
The reverse proxy in front of kneo-serv (tls_and_proxy.md)
has the true client IP. The service logs the immediate TCP peer; correlate
to the proxy's access logs by request_id (forward X-Request-ID upstream).
Service-side trace events¶
Service-side trace events are persisted as part of run state and
returned at GET /v1/runs/{run_id}/trace. They cover workflow step
transitions, tool calls, checkpoints, and audit boundaries. These events
are emitted by TracingMiddleware independent of any OTel exporter, so
they are always available even without OpenTelemetry.
See
service_api.md § Audit events and
service_api.md § Replay and checkpoint diff
for the contract.
OpenTelemetry spans¶
When the deployment includes the SDK telemetry support (the [telemetry]
or [deploy] extras), set KNEO_SERV_OTEL_ENABLED=true to attach
kneo_agent.observability.OpenTelemetryMiddleware. Argument and result
capture (KNEO_SERV_OTEL_RECORD_ARGUMENTS,
KNEO_SERV_OTEL_RECORD_RESULTS) are off by default because tool inputs
and outputs frequently contain user payloads — enable them only after
the deployment's data classification has approved payload capture.
See environment.md § Observability for
the full env-var reference.
Platform-side spans¶
The SDK's OpenTelemetryMiddleware covers the agent boundary — runs,
tool calls, model calls. The platform also instruments operations that
happen outside the agent's execution:
| Span name | Where | Attributes |
|---|---|---|
kneo.queue.dispatch |
PlatformManager.dispatch_run — when a run is enqueued for an async worker |
kneo.run.id |
kneo.worker.lease |
Async worker loop — one span per lease attempt against the queue | kneo.worker.id, kneo.worker.lease_seconds, kneo.worker.claimed (bool), kneo.run.id (if claimed) |
kneo.continuation.lock |
PlatformManager.resume_human_task — when the per-continuation lock is acquired before resume |
kneo.continuation.id, kneo.lock.name, kneo.lock.ttl_seconds, kneo.lock.acquired (bool) |
These spans share the same KNEO_SERV_OTEL_ENABLED flag — they're a
clean no-op when telemetry is off (no overhead beyond a single env-var
check). Span names use the kneo.<area>.<operation> convention so they
sort cleanly alongside SDK-owned spans in tracing UIs.
Lease spans with kneo.worker.claimed=false indicate an empty queue —
useful for measuring how often workers idle. Continuation lock spans
with kneo.lock.acquired=false correlate with the LockAcquisitionError
shown in troubleshooting.md § 8.1.
Exporter configuration¶
The service uses the OpenTelemetry global tracer provider; exporters are
configured with standard OTEL_* environment variables that the OTel
SDK reads. Example for OTLP/HTTP to any compatible backend (Honeycomb,
Grafana Tempo, Tempo Cloud, Datadog, an OTel Collector):
export KNEO_SERV_OTEL_ENABLED=true
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
export OTEL_EXPORTER_OTLP_ENDPOINT=https://api.honeycomb.io
export OTEL_EXPORTER_OTLP_HEADERS="x-honeycomb-team=$HONEYCOMB_API_KEY"
export OTEL_SERVICE_NAME=kneo-serv
export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=prod"
For a self-hosted OTel Collector running as a sidecar:
If OTel does not appear to be exporting, see
troubleshooting.md § 7.2.
What to watch in production¶
A minimal alerting baseline covers the failure modes that page on-call:
| Signal | What it means | Where to read it |
|---|---|---|
/readyz returns 503 for more than 1 probe interval |
A dependency check is failing | Reverse proxy / load balancer health checks |
Sustained 5xx rate above baseline |
Service-side errors | Proxy access logs; status_code from JSON logs |
duration_ms p95 climbing over baseline |
Latency regression — provider, queue, or DB pressure | JSON log records |
Queue depth (status=queued) growing unbounded |
Workers stuck or backpressured | /readyz queue check; curl "$BASE/v1/runs?status=queued&limit=20" |
Spike in event=http_request records with error |
Application-level exceptions | JSON log records |
Wire your alerting against these signals from the proxy and the aggregated logs; the service does not push its own alerts.
What this page does not cover¶
- Per-IP rate limiting and traffic shaping. The reverse proxy's job
(
tls_and_proxy.md). - A Prometheus
/metricsendpoint. Not provided in the 0.4.x line. Use OTel spans to derive request rate, error rate, and latency, or scrape the reverse proxy. - Tracing internals. For the design of the in-process tracer and
checkpoint events, see
docs/dev/design.mdanddocs/dev/implementation_map.md.