Skip to content

Observability

Operator guide for wiring kneo-serv's structured logs, request tracing, and OpenTelemetry exports into a production observability stack.

This page is the setup view. For symptoms and recovery when observability itself misbehaves, see troubleshooting.md § 7. For the full env-var list, see environment.md § Observability.

Three signals, three surfaces

Signal What it is Where it comes from
Structured request logs One JSON record per HTTP request, redacted RequestLoggingMiddleware (always on by default)
Service-side trace events Per-run trace and checkpoint records, queryable via the API TracingMiddleware, exposed at /v1/runs/{run_id}/trace
OpenTelemetry spans Distributed-tracing spans across SDK-driven agent / tool calls and platform-side operations (queue dispatch, worker lease, continuation lock) kneo_agent.observability.OpenTelemetryMiddleware + kneo_serv.observability.platform_tracer, opt-in via KNEO_SERV_OTEL_ENABLED

There is no metrics endpoint in the 0.4.x line. Scrape latency and error rate from your reverse-proxy access logs or from OTel span attributes.

Structured request logs

Shape

Each request emits a single JSON record on the kneo_serv.service logger:

{
  "client_ip": "10.0.0.7",
  "duration_ms": 18.214,
  "event": "http_request",
  "method": "POST",
  "path": "/v1/runs",
  "request_id": "f3b3…",
  "run_id": "run_…",
  "status_code": 201,
  "user_agent": "kneo-serv-client/0.2.2"
}

Fields the middleware always emits: event, request_id, method, path, status_code, duration_ms. Optional fields when available: client_ip, user_agent. Route-derived fields when the path includes them: run_id, continuation_id. When the request raises: error (exception class name) and message (exception message). Redaction is applied to every payload before it reaches the log line. (kneo_serv/observability/structured_logging.py)

Configuration

Variable Default Purpose
KNEO_SERV_REQUEST_LOGS true Enable the JSON request log middleware.
KNEO_SERV_LOG_LEVEL INFO Service logger level.

request_id is generated server-side as a UUID unless the client sends X-Request-ID; either way the service echoes it back on the response header.

Production tuning

  • Keep KNEO_SERV_LOG_LEVEL=INFO in production. DEBUG doubles log volume and can leak diagnostic payloads from middleware that wraps the request logger.
  • Configure your container runtime's log driver (Docker json-file with rotation, Kubernetes kubectl logs rotation, journald) — the service writes to stdout and relies on the runtime to rotate.

Log aggregation wiring

  • ELK / OpenSearch. Ship stdout via Filebeat or Vector. The records are already JSON; map request_id and run_id as indexed fields. Pin service.name=kneo-serv from the shipper for cross-deployment search.
  • Loki. A Promtail pipeline with a json stage will lift request_id, run_id, status_code, and duration_ms to labels. Keep label cardinality bounded — don't promote request_id to a Loki label, query it as content.
  • Cloud-managed (CloudWatch Logs, GCP Logging). Forward stdout; the managed pipeline parses JSON automatically.

The reverse proxy in front of kneo-serv (tls_and_proxy.md) has the true client IP. The service logs the immediate TCP peer; correlate to the proxy's access logs by request_id (forward X-Request-ID upstream).

Service-side trace events

Service-side trace events are persisted as part of run state and returned at GET /v1/runs/{run_id}/trace. They cover workflow step transitions, tool calls, checkpoints, and audit boundaries. These events are emitted by TracingMiddleware independent of any OTel exporter, so they are always available even without OpenTelemetry.

See service_api.md § Audit events and service_api.md § Replay and checkpoint diff for the contract.

OpenTelemetry spans

When the deployment includes the SDK telemetry support (the [telemetry] or [deploy] extras), set KNEO_SERV_OTEL_ENABLED=true to attach kneo_agent.observability.OpenTelemetryMiddleware. Argument and result capture (KNEO_SERV_OTEL_RECORD_ARGUMENTS, KNEO_SERV_OTEL_RECORD_RESULTS) are off by default because tool inputs and outputs frequently contain user payloads — enable them only after the deployment's data classification has approved payload capture.

See environment.md § Observability for the full env-var reference.

Platform-side spans

The SDK's OpenTelemetryMiddleware covers the agent boundary — runs, tool calls, model calls. The platform also instruments operations that happen outside the agent's execution:

Span name Where Attributes
kneo.queue.dispatch PlatformManager.dispatch_run — when a run is enqueued for an async worker kneo.run.id
kneo.worker.lease Async worker loop — one span per lease attempt against the queue kneo.worker.id, kneo.worker.lease_seconds, kneo.worker.claimed (bool), kneo.run.id (if claimed)
kneo.continuation.lock PlatformManager.resume_human_task — when the per-continuation lock is acquired before resume kneo.continuation.id, kneo.lock.name, kneo.lock.ttl_seconds, kneo.lock.acquired (bool)

These spans share the same KNEO_SERV_OTEL_ENABLED flag — they're a clean no-op when telemetry is off (no overhead beyond a single env-var check). Span names use the kneo.<area>.<operation> convention so they sort cleanly alongside SDK-owned spans in tracing UIs.

Lease spans with kneo.worker.claimed=false indicate an empty queue — useful for measuring how often workers idle. Continuation lock spans with kneo.lock.acquired=false correlate with the LockAcquisitionError shown in troubleshooting.md § 8.1.

Exporter configuration

The service uses the OpenTelemetry global tracer provider; exporters are configured with standard OTEL_* environment variables that the OTel SDK reads. Example for OTLP/HTTP to any compatible backend (Honeycomb, Grafana Tempo, Tempo Cloud, Datadog, an OTel Collector):

export KNEO_SERV_OTEL_ENABLED=true

export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
export OTEL_EXPORTER_OTLP_ENDPOINT=https://api.honeycomb.io
export OTEL_EXPORTER_OTLP_HEADERS="x-honeycomb-team=$HONEYCOMB_API_KEY"
export OTEL_SERVICE_NAME=kneo-serv
export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=prod"

For a self-hosted OTel Collector running as a sidecar:

export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
export OTEL_EXPORTER_OTLP_PROTOCOL=grpc

If OTel does not appear to be exporting, see troubleshooting.md § 7.2.

What to watch in production

A minimal alerting baseline covers the failure modes that page on-call:

Signal What it means Where to read it
/readyz returns 503 for more than 1 probe interval A dependency check is failing Reverse proxy / load balancer health checks
Sustained 5xx rate above baseline Service-side errors Proxy access logs; status_code from JSON logs
duration_ms p95 climbing over baseline Latency regression — provider, queue, or DB pressure JSON log records
Queue depth (status=queued) growing unbounded Workers stuck or backpressured /readyz queue check; curl "$BASE/v1/runs?status=queued&limit=20"
Spike in event=http_request records with error Application-level exceptions JSON log records

Wire your alerting against these signals from the proxy and the aggregated logs; the service does not push its own alerts.

What this page does not cover

  • Per-IP rate limiting and traffic shaping. The reverse proxy's job (tls_and_proxy.md).
  • A Prometheus /metrics endpoint. Not provided in the 0.4.x line. Use OTel spans to derive request rate, error rate, and latency, or scrape the reverse proxy.
  • Tracing internals. For the design of the in-process tracer and checkpoint events, see docs/dev/design.md and docs/dev/implementation_map.md.