Security hardening¶
Pre-launch checklist for taking a kneo-serv deployment to production.
Each item references the authoritative configuration doc; this page is
the single sheet to walk before going live.
For the auth model itself (roles, scopes, route mapping), see
service_api.md § Authentication. For
audit-event details, see
service_api.md § Audit events.
Pre-launch checklist¶
1. Enable authentication¶
- [ ]
KNEO_SERV_AUTH_ENABLED=trueis set (orKNEO_SERV_API_KEYS/KNEO_SERV_ADMIN_API_KEYare set, which enables auth implicitly). - [ ] No API keys are committed to the repo, the Compose
.envfile, or example configs. - [ ] Each consumer has its own key, named so audit events identify the
caller (
KNEO_SERV_API_KEYS='ci:…:service;analyst:…:viewer'). - [ ] The admin key (
KNEO_SERV_ADMIN_API_KEY) is issued separately and used only for break-glass operations.
2. Assign the narrowest role¶
Pick the narrowest built-in role that covers each caller's needs. The
canonical role-to-scope mapping lives in
service_api.md § Authentication; below
is the operational guidance for choosing between them.
| Role | Use for |
|---|---|
admin |
Break-glass operator key only |
operator |
Day-to-day operator console / CI deploy |
service |
Server-to-server callers that drive runs |
reviewer |
Human-in-the-loop approvers |
viewer |
Dashboards, read-only analytics |
Custom scopes are allowed in the third field of KNEO_SERV_API_KEYS when
no built-in role fits.
- [ ] No consumer is using
adminfor routine traffic. - [ ] Read-only consumers are on
viewer, notoperator.
3. Rotate keys without downtime¶
kneo-serv has no in-place key rotation API in the 0.4.x line.
Rotation is a config swap:
- Add the new key to
KNEO_SERV_API_KEYSalongside the old key (semicolon-separated entries; samename:is fine). - Restart the service. Both keys are now valid.
- Roll callers over to the new key.
- Remove the old entry from
KNEO_SERV_API_KEYS. -
Restart again.
-
[ ] Rotation procedure is rehearsed in staging before production keys are issued.
- [ ] Old keys are revoked, not left "in case."
4. Sign spec bundles for production¶
For environments that block ad-hoc spec edits:
- [ ]
KNEO_SERV_SPEC_SIGNING_KEYis set in CI and on the service hosts (different value than any API key). - [ ] Production deploys use signed bundles only:
kneo spec bundle sign … --approved-by <name> --env prodandkneo spec bundle verify <bundle>in the deploy pipeline. - [ ] Project config declares
environments.prod.policy_enforcementso CLI and API spec flows enforce policy after overlays (project_config.md).
5. Terminate TLS upstream¶
- [ ] A reverse proxy in front of the service terminates TLS; the
service bind address is
127.0.0.1or restricted to a private network. Seetls_and_proxy.md. - [ ]
/readyzis exposed only to the load balancer or probe subnet (seetls_and_proxy.md § Health-check endpoints behind the proxy). - [ ] The proxy enforces a request body size limit ≥
KNEO_SERV_MAX_BODY_BYTES.
6. Lock down container and host¶
The bundled Dockerfile already enforces a non-root user (kneo); you
do not need to override it.
(Dockerfile)
- [ ] Container runs as the non-root
kneouser (default). - [ ] Image is pulled from a trusted registry; tags are pinned by digest in production manifests.
- [ ] CI scans the image for known CVEs; releases blocked on HIGH/CRITICAL findings (§ Image vulnerability scanning).
- [ ] Host or Kubernetes drops Linux capabilities the service doesn't
need (no
NET_ADMIN, noSYS_ADMIN). - [ ] Egress is restricted at the network layer to the provider, MCP, and observability endpoints the deployment actually uses.
The container's filesystem is not read-only — the service writes
checkpoints, queue state, and optionally SQLite files. If you need
read-only-root, mount writable volumes for .kneo/ (SQLite + continuations)
and the artifact paths declared in your spec.
7. Protect the audit trail¶
- [ ] Audit events are persisted in the same backend as run state
(PostgreSQL in production); the DB is backed up per
backup_and_recovery.md. - [ ]
audit:readis scoped to a small set of principals (compliance, on-call, incident-response). - [ ] Audit-event retention is set deliberately. If
KNEO_SERV_RETENTION_RUNS_DAYSis set, runs and their audit events age out together — confirm that aligns with your compliance window before enabling it (environment.md § Retention).
8. Keep redaction in place¶
kneo-serv redacts secrets, tokens, authorization headers, emails, and
SSNs from responses, traces, checkpoints, and CLI JSON output by default
(service_api.md § Redaction). The two
escape hatches both default to off:
- [ ]
KNEO_SERV_OTEL_RECORD_ARGUMENTS=false(unless tool arguments are classified safe to emit to your trace backend). - [ ]
KNEO_SERV_OTEL_RECORD_RESULTS=false(same rationale). - [ ] Custom tools and middleware do not log user inputs or provider responses without redaction.
Image vulnerability scanning¶
The release pipeline scans every published GHCR image for known CVEs using Trivy. The scan runs against the pushed image digest (the same bytes cosign signed and SBOM attestation describes), so the four supply-chain artifacts — image, cosign signature, SBOM attestation, and scan report — all agree on what they describe.
Locked policy (0.4.0; recorded in plan/TODO-0.4.0.md):
- Severity threshold: CVSS≥7 (HIGH and CRITICAL findings).
- Release-tag scans (
v<version>/v<version>rcN): blocking. The Trivy step inrelease.ymlruns with--exit-code 1; HIGH/CRITICAL findings fail the step, preventing thePublish build artifactandPublish GitHub releasesteps from running. The publish is the irreversible step, so the gate fires there. - PR-time scans: report-only via
.github/workflows/image-scan.yml. The same scanner version + severity threshold runs against a locally-built image but with--exit-code 0, so findings surface in the PR's check summary without blocking merges. Dev velocity isn't gated on un-fixable transient base-image CVEs; the release gate catches anything that matters before publish. - Scan report retention: the JSON report is attached as a GitHub
Actions artifact (
trivy-report-<version>) on every release-tag build, retained for 90 days. The deployer can download it for audit.
Operator-side verification¶
Re-run the scan locally against any published tag:
trivy image \
--severity HIGH,CRITICAL \
--ignore-unfixed=false \
ghcr.io/kneo-agent/kneo-serv:<tag>
Cross-check against the release-time scan output by downloading the
trivy-report-<version> artifact from the GitHub Release.
Accepted findings¶
If an upstream CVE has no fix available, or the deployer's risk
tolerance accepts a specific finding (e.g. low exploitability in
your network posture), record the acceptance in
supply_chain_review.md § Current workspace result
using the same shape as the existing pip-audit remediation
blocks. The release pipeline does not implement an inline-ignore
mechanism — acceptances are deployer policy, not platform policy.
What kneo-serv deliberately does not provide¶
Operators sometimes go looking for these; document the gap rather than inventing it:
- No built-in TLS. Terminate at a reverse proxy
(
tls_and_proxy.md). - No per-IP rate limiting. Use the reverse proxy's rate-limit zone.
- No mTLS to upstream providers. Provider HTTPS calls leave the service host; control with egress firewall rules.
- No live key rotation API. Keys are configured via env vars; rotate with a config swap and restart (§ 3).
- No external secret-manager integration. Secrets are injected via environment variables. Use your platform's secret store (Kubernetes Secrets, AWS Secrets Manager, Vault) to populate the env at startup.
- No SCIM or directory integration. API keys are flat-file in
KNEO_SERV_API_KEYS; map them to identities in your audit log aggregator.
These are tracked in the roadmap, not bugs. See
docs/plan/roadmap.md.