Skip to content

Security hardening

Pre-launch checklist for taking a kneo-serv deployment to production. Each item references the authoritative configuration doc; this page is the single sheet to walk before going live.

For the auth model itself (roles, scopes, route mapping), see service_api.md § Authentication. For audit-event details, see service_api.md § Audit events.

Pre-launch checklist

1. Enable authentication

  • [ ] KNEO_SERV_AUTH_ENABLED=true is set (or KNEO_SERV_API_KEYS / KNEO_SERV_ADMIN_API_KEY are set, which enables auth implicitly).
  • [ ] No API keys are committed to the repo, the Compose .env file, or example configs.
  • [ ] Each consumer has its own key, named so audit events identify the caller (KNEO_SERV_API_KEYS='ci:…:service;analyst:…:viewer').
  • [ ] The admin key (KNEO_SERV_ADMIN_API_KEY) is issued separately and used only for break-glass operations.

2. Assign the narrowest role

Pick the narrowest built-in role that covers each caller's needs. The canonical role-to-scope mapping lives in service_api.md § Authentication; below is the operational guidance for choosing between them.

Role Use for
admin Break-glass operator key only
operator Day-to-day operator console / CI deploy
service Server-to-server callers that drive runs
reviewer Human-in-the-loop approvers
viewer Dashboards, read-only analytics

Custom scopes are allowed in the third field of KNEO_SERV_API_KEYS when no built-in role fits.

  • [ ] No consumer is using admin for routine traffic.
  • [ ] Read-only consumers are on viewer, not operator.

3. Rotate keys without downtime

kneo-serv has no in-place key rotation API in the 0.4.x line. Rotation is a config swap:

  1. Add the new key to KNEO_SERV_API_KEYS alongside the old key (semicolon-separated entries; same name: is fine).
  2. Restart the service. Both keys are now valid.
  3. Roll callers over to the new key.
  4. Remove the old entry from KNEO_SERV_API_KEYS.
  5. Restart again.

  6. [ ] Rotation procedure is rehearsed in staging before production keys are issued.

  7. [ ] Old keys are revoked, not left "in case."

4. Sign spec bundles for production

For environments that block ad-hoc spec edits:

  • [ ] KNEO_SERV_SPEC_SIGNING_KEY is set in CI and on the service hosts (different value than any API key).
  • [ ] Production deploys use signed bundles only: kneo spec bundle sign … --approved-by <name> --env prod and kneo spec bundle verify <bundle> in the deploy pipeline.
  • [ ] Project config declares environments.prod.policy_enforcement so CLI and API spec flows enforce policy after overlays (project_config.md).

5. Terminate TLS upstream

6. Lock down container and host

The bundled Dockerfile already enforces a non-root user (kneo); you do not need to override it. (Dockerfile)

  • [ ] Container runs as the non-root kneo user (default).
  • [ ] Image is pulled from a trusted registry; tags are pinned by digest in production manifests.
  • [ ] CI scans the image for known CVEs; releases blocked on HIGH/CRITICAL findings (§ Image vulnerability scanning).
  • [ ] Host or Kubernetes drops Linux capabilities the service doesn't need (no NET_ADMIN, no SYS_ADMIN).
  • [ ] Egress is restricted at the network layer to the provider, MCP, and observability endpoints the deployment actually uses.

The container's filesystem is not read-only — the service writes checkpoints, queue state, and optionally SQLite files. If you need read-only-root, mount writable volumes for .kneo/ (SQLite + continuations) and the artifact paths declared in your spec.

7. Protect the audit trail

  • [ ] Audit events are persisted in the same backend as run state (PostgreSQL in production); the DB is backed up per backup_and_recovery.md.
  • [ ] audit:read is scoped to a small set of principals (compliance, on-call, incident-response).
  • [ ] Audit-event retention is set deliberately. If KNEO_SERV_RETENTION_RUNS_DAYS is set, runs and their audit events age out together — confirm that aligns with your compliance window before enabling it (environment.md § Retention).

8. Keep redaction in place

kneo-serv redacts secrets, tokens, authorization headers, emails, and SSNs from responses, traces, checkpoints, and CLI JSON output by default (service_api.md § Redaction). The two escape hatches both default to off:

  • [ ] KNEO_SERV_OTEL_RECORD_ARGUMENTS=false (unless tool arguments are classified safe to emit to your trace backend).
  • [ ] KNEO_SERV_OTEL_RECORD_RESULTS=false (same rationale).
  • [ ] Custom tools and middleware do not log user inputs or provider responses without redaction.

Image vulnerability scanning

The release pipeline scans every published GHCR image for known CVEs using Trivy. The scan runs against the pushed image digest (the same bytes cosign signed and SBOM attestation describes), so the four supply-chain artifacts — image, cosign signature, SBOM attestation, and scan report — all agree on what they describe.

Locked policy (0.4.0; recorded in plan/TODO-0.4.0.md):

  • Severity threshold: CVSS≥7 (HIGH and CRITICAL findings).
  • Release-tag scans (v<version> / v<version>rcN): blocking. The Trivy step in release.yml runs with --exit-code 1; HIGH/CRITICAL findings fail the step, preventing the Publish build artifact and Publish GitHub release steps from running. The publish is the irreversible step, so the gate fires there.
  • PR-time scans: report-only via .github/workflows/image-scan.yml. The same scanner version + severity threshold runs against a locally-built image but with --exit-code 0, so findings surface in the PR's check summary without blocking merges. Dev velocity isn't gated on un-fixable transient base-image CVEs; the release gate catches anything that matters before publish.
  • Scan report retention: the JSON report is attached as a GitHub Actions artifact (trivy-report-<version>) on every release-tag build, retained for 90 days. The deployer can download it for audit.

Operator-side verification

Re-run the scan locally against any published tag:

trivy image \
  --severity HIGH,CRITICAL \
  --ignore-unfixed=false \
  ghcr.io/kneo-agent/kneo-serv:<tag>

Cross-check against the release-time scan output by downloading the trivy-report-<version> artifact from the GitHub Release.

Accepted findings

If an upstream CVE has no fix available, or the deployer's risk tolerance accepts a specific finding (e.g. low exploitability in your network posture), record the acceptance in supply_chain_review.md § Current workspace result using the same shape as the existing pip-audit remediation blocks. The release pipeline does not implement an inline-ignore mechanism — acceptances are deployer policy, not platform policy.

What kneo-serv deliberately does not provide

Operators sometimes go looking for these; document the gap rather than inventing it:

  • No built-in TLS. Terminate at a reverse proxy (tls_and_proxy.md).
  • No per-IP rate limiting. Use the reverse proxy's rate-limit zone.
  • No mTLS to upstream providers. Provider HTTPS calls leave the service host; control with egress firewall rules.
  • No live key rotation API. Keys are configured via env vars; rotate with a config swap and restart (§ 3).
  • No external secret-manager integration. Secrets are injected via environment variables. Use your platform's secret store (Kubernetes Secrets, AWS Secrets Manager, Vault) to populate the env at startup.
  • No SCIM or directory integration. API keys are flat-file in KNEO_SERV_API_KEYS; map them to identities in your audit log aggregator.

These are tracked in the roadmap, not bugs. See docs/plan/roadmap.md.