Skip to content

Upgrade guide

Conventions for upgrading Kneo Agent Platform (kneo-serv) between releases, plus version-specific notes when a release has breaking changes.

For the release process itself (gates, tagging, artifacts), see release_checklist.md. For the supported kneo_agent SDK range, see sdk_alignment.md.

Versioning

Kneo Agent Platform follows semantic versioning:

  • Patch (0.1.00.1.1): bug fixes; persistence schemas, route contracts, CLI commands, and env-var names do not change.
  • Minor (0.1.x0.2.0): additive changes. Persistence schemas may add new tables or columns with migrations; routes and CLI may add new surfaces. Existing surfaces remain available with the same shape unless the release notes call out an exception.
  • Major (0.x1.0): may remove or change surfaces. Read the release notes before upgrading; expect to update calling code.

The HTTP API is also versioned at the URL prefix (/v1); legacy unversioned routes remain available alongside /v1. See design.md § 13.

Standard upgrade procedure

  1. Read the release notes for every minor/major version between your current and target version. Patch upgrades only need the latest patch's notes. Notes for the current release are at release_notes_0.1.0.md.
  2. Pin the target version in your dependency manifest:
    kneo-serv[deploy]==X.Y.Z
    
  3. Stop traffic to the service (or drain via a load balancer). Background runs that are queued will be reclaimed by the worker after restart; in-flight runs that complete during the drain will record normally.
  4. Back up persistence. Follow backup_and_recovery.md (pg_dump for PostgreSQL, backup_sqlite_database() for SQLite). Keep the backup until you have verified the new version through at least one business cycle.
  5. Install the new version in your deployment image or environment.
  6. Restart the service. Migrations apply automatically at startup. Watch the structured log for migration events and any migration_failed errors.
  7. Verify with GET /readyz and the deployment smoke script:
    python scripts/deployment_smoke.py --base-url http://<host>:<port>
    
  8. Resume traffic.

If GET /readyz does not return 200 within a few seconds of restart, see troubleshooting.md § 1.2.

Persistence migrations

Every store that has a schema (SQLiteRunStateStore, PostgresRunStateStore) tracks its schema version and applies forward-only migrations on first connection. Migrations are idempotent and never drop columns or rows on their own. The file-based stores have no schema; they tolerate older record shapes through the row decoder.

If a migration fails, the service refuses to serve requests rather than running on a partially-migrated schema. Fix the underlying cause (usually a permissions or disk-space problem), then restart.

Downgrades are not supported. Restore from backup if you need to revert.

For contributors authoring new migrations (conventions, the dialect portability rules, the test patterns), see docs/dev/migrations.md.

Spec migrations

The YAML spec format is versioned at version: v1. The compiler accepts older shapes through automatic normalization, but for clarity the CLI can write upgraded specs to disk:

kneo spec migrate legacy_agent.yaml --output migrated_agent.yaml
kneo spec migrate migrated_agent.yaml --check --json

Specs that pass kneo spec validate on the source version will continue to compile after upgrading; specs that hit deprecation warnings should be migrated proactively before a future release removes the fallback.

Signed bundles created with kneo spec bundle sign are tied to the signing key, not the kneo-serv version, so bundles signed before an upgrade continue to verify after as long as the signing key is unchanged.

SDK compatibility

kneo-serv declares a kneo-agent range in pyproject.toml. When upgrading kneo-serv, let pip resolve the matching SDK; do not pin SDK versions outside that range. The compatibility tests (tests/test_sdk_compatibility.py) assert the SDK surface used by the service, so a version mismatch surfaces as a test failure.

If you maintain custom runtimes or middlewares that import directly from kneo_agent, run those compatibility tests after upgrading and update imports in lockstep.

Configuration changes

Environment-variable names and defaults are part of the public surface. Changes are recorded in environment.md and called out in release notes:

  • New variables default to behavior consistent with the previous release.
  • Renamed variables retain a deprecation alias for at least one minor release; a startup warning is emitted when the alias is used.
  • Removed variables are removed only at major versions.

After upgrading, diff your env file against the latest deploy/production.env.example (or staging.env.example) to spot any new optional variables.

CLI changes

The kneo CLI is regenerated each release; see cli_reference.md for the current shape. New subcommands are additive within minor releases. Subcommand behavior may change at major releases — check the release notes.

CLI profiles stored at ~/.kneo_serv/profiles.json carry forward across releases. The profile schema is itself versioned and migrated in place.

Version-specific notes

This section grows as releases ship. Each entry should describe what changed, what action operators must take, and how to verify the upgrade.

0.1.0 — initial release

No upgrade applies; this is the first published version. See release_notes_0.1.0.md for scope, capabilities, and verified release-candidate steps.

0.2.0 — first public distribution

This is the first cut to publish a real kneo-serv package. 0.1.0 and 0.1.1 shipped as GitHub Release artifacts only; 0.2.0 is the first version available via pip install kneo-serv and docker pull ghcr.io/kneo-agent/kneo-serv.

Version trajectory on PyPI: 0.0.0 → 0.2.0. The kneo-serv 0.0.0 placeholder published on 2026-05-14 reserved the distribution name; it shipped an empty importable module with no kneo CLI binary (no [project.scripts] entry). Any user who tried pip install kneo-serv && kneo --version during the placeholder window saw kneo: command not found — 0.2.0 is the first cut to install the binary. The placeholder is yanked once 0.2.0 ships; existing explicit ==0.0.0 pins still resolve, but default pip install kneo-serv jumps straight to 0.2.0.

Install paths: - pip install kneo-serv — first time this works end-to-end. - docker pull ghcr.io/kneo-agent/kneo-serv:0.2.0 (and :0.2, and :latest) — first time the image is available without a local build.

Deployment migration for operators on 0.1.x using compose.yaml with the bundled build: context: .: - Default flow becomes docker compose pull && docker compose up -d against the GHCR image. - The build: block stays in compose.yaml for contributors and the CI smoke test (docker compose up --build). - No required changes to deploy/production.env or deploy/staging.env from 0.1.1.

Persistence schemas: unchanged from 0.1.1. No migrations required.

Feature additions visible to operators (full per-feature detail in release_notes_0.2.0.md): - kneo spec lint — CI-friendly validator subcommand that exits non-zero on any warnings or errors. - Retention windows now live in .kneo/config.yaml under a retention: block, with env vars as the operator override. - Human-task expiration via PlatformManager.prune_expired_human_tasks() — paused runs whose human-step deadline has passed transition to a new expired status and emit human.expired audit events. - Two new reference example specs: concurrent_review_workflow.yaml and group_chat_workflow.yaml. - Docker-based local PostgreSQL integration testing via python scripts/postgres_test.py.

No breaking changes to spec syntax, HTTP API contracts, CLI command names, env-var names, or persistence schemas. Specs that validated under 0.1.1 continue to validate under 0.2.0.

0.2.1 — /healthz version and Docker /app permission fix

Patch release fixing two regressions discovered while smoke-testing the published 0.2.0 image. Both are bug fixes; no new features, no contract changes.

Upgrade: - pip install -U kneo-serv (resolves to 0.2.1). - docker pull ghcr.io/kneo-agent/kneo-serv:0.2.1:0.2 and :latest now resolve to the 0.2.1 digest.

What was broken in 0.2.0: - GET /healthz returned "version":"0.1.0" from the 0.2.0 image because HealthResponse.version was a hardcoded string literal. 0.2.1 resolves the field dynamically via importlib.metadata.version("kneo-serv"). - Plain docker run -p 8000:8000 ghcr.io/kneo-agent/kneo-serv:0.2.0 crashed on startup with PermissionError: [Errno 13] Permission denied: '.kneo' because /app was root-owned but the container drops to the non-root kneo user before creating the SQLite-fallback path. 0.2.1 adds chown -R kneo:kneo /app to the install layer. The Docker Compose deployment path was unaffected (it pins KNEO_SERV_DATABASE_URL to PostgreSQL).

Persistence schemas: unchanged from 0.2.0. No migrations required.

No breaking changes to spec syntax, HTTP API contracts, CLI command names, env-var names, or persistence schemas.

0.2.2 — FastAPI info.version fix + post-0.2.0 docs sweep

Patch release fixing one regression in the same family as 0.2.1 plus a documentation sweep. No feature changes, no contract changes, no schema changes.

Upgrade: - pip install -U kneo-serv (resolves to 0.2.2). - docker pull ghcr.io/kneo-agent/kneo-serv:0.2.2:0.2 and :latest now resolve to the 0.2.2 digest.

What was broken in 0.2.1: - GET /openapi.json returned info.version: "0.1.0" from the 0.2.1 image because the FastAPI app constructor in kneo_serv/service/app.py still pinned a hardcoded literal. The 0.2.1 cut fixed HealthResponse.version but missed this parallel occurrence. 0.2.2 resolves both via the same importlib.metadata.version("kneo-serv") helper, called at app-construction time.

Documentation: - Forward-looking plan docs and "as of 0.1.0" framing in user/dev docs swept to match the 0.2.x shipped reality. No content lost — historical files (CHANGELOG entries, shipped release notes, TODO-0.2.0.md, ADRs) are unchanged.

Persistence schemas: unchanged from 0.2.1. No migrations required.

0.3.0

Next additive minor on the 0.2.x line. No breaking changes to spec syntax, HTTP API contracts, CLI command names, env-var names, or persistence schemas. Full narrative in release_notes_0.3.0.md.

Upgrade: - pip install -U kneo-serv (resolves to 0.3.0). - docker pull ghcr.io/kneo-agent/kneo-serv:0.3.0:0.3 and :latest now resolve to the 0.3.0 digest. The image is now signed (cosign keyless via Sigstore) and ships with a CycloneDX SBOM attestation; verification commands are in supply_chain_review.md § Verification commands.

SDK floor bump: - The kneo-agent SDK floor moves from >=1.1.1 to >=1.2.0. Pip auto-resolves on pip install -U kneo-serv, but operators pinning the SDK separately (e.g. via a constraints file or a monorepo lockfile) must ensure their install is on 1.2.0 or newer. The compat test suite passed against kneo-agent 1.2.0 throughout the 0.2.x line; the floor was kept low to avoid forcing 0.1.x users to upgrade. 0.3.0 is the natural inflection point to lift it.

New timed_out lifecycle status: - Runs that hit their run-level deadline transition to a new terminal timed_out status (alongside completed, failed, cancelled, expired). Operator tooling that switches on state.status should accept it as terminal — e.g. dashboards, alerting rules, retention sweeps (which the platform's own RetentionPolicy.run_statuses already includes). - The error.type field on a timed-out run is run_timed_out, distinct from human_task_expired (which the existing expired status uses).

New runtime surfaces: - start_run_from_spec(..., timeout_seconds=N) and run_from_spec(..., timeout_seconds=N) accept an optional wall-clock deadline. Operator-callable PlatformManager.prune_timed_out_runs() walks runs and force-cancels those past their deadline. Same operator-cron pattern as prune_retention() and prune_expired_human_tasks() — no built-in scheduler. - The human-task on_timeout: continue and on_timeout: escalate literals are now wired in the runtime (they were accepted by the spec but silently treated as fail in 0.2.x). Operators with specs that declared these literals will see the documented behaviour for the first time. Audit consumers should expect new event types: human.continued, human.continue_failed, human.escalated, run.timed_out. - New route GET /v1/runs/{run_id}/policy-report returns the spec policy report for a stored run, no spec bundle required client-side. Auth: specs:read scope (same as the existing POST /v1/specs/policy-report).

New observability surfaces: - Three new platform-side OpenTelemetry spans (kneo.queue.dispatch, kneo.worker.lease, kneo.continuation.lock) join the SDK's agent-boundary spans when KNEO_SERV_OTEL_ENABLED=true. Pre-existing OTel pipelines pick them up automatically once telemetry is enabled — no extra configuration required. See observability.md § Platform-side spans.

Persistence schemas: unchanged. The new RunState.deadline_at and Checkpoint.iteration fields default to None and 1 respectively in the dataclass, so existing rows round-trip cleanly through the JSON-payload SQLite / PostgreSQL stores.

0.4.0

Next additive minor on the 0.3.x line. No breaking changes to spec syntax, HTTP API contracts, CLI command names, env-var names, or persistence schemas. Specs that validated under 0.3.x continue to validate under 0.4.0. The cut is a docs + tooling release — runtime semantics are identical to 0.3.0. Full narrative in release_notes_0.4.0.md.

Upgrade: - pip install -U kneo-serv (resolves to 0.4.0). - docker pull ghcr.io/kneo-agent/kneo-serv:0.4.0:0.4 and :latest now resolve to the 0.4.0 digest. Image continues to be signed (cosign keyless via Sigstore) and ships with a CycloneDX SBOM attestation; the 0.4.0 cut adds a Trivy CVE scan report attached to the GitHub Release. Verification commands are in supply_chain_review.md § Verification commands.

SDK floor: unchanged. The kneo-agent floor stays at >=1.2.0 — same as 0.3.0. No operator action required for operators pinning the SDK separately.

New auto-generated API reference: the docs site at kneo-agent.github.io/kneo-serv/ gains a new top-level API Reference nav section with 17 pages (16 subpackages + sdk), rendered at build time by mkdocstrings from the Python docstrings. Operator surface unchanged — the API ref is a developer lookup surface, not a runtime change. See docs/api/README.md for the index.

Image vulnerability scanning (Trivy): the release pipeline now scans the pushed GHCR image with Trivy under the CVSS≥7 policy (HIGH/CRITICAL findings block the publish step). On every release-tag build, the JSON scan report is attached to the GitHub Release as the trivy-report-<version> artifact, 90-day retention. Deployers can re-run the scan locally with trivy image ghcr.io/kneo-agent/kneo-serv:<tag>; full policy + escape hatch documented in security_hardening.md § Image vulnerability scanning.

Developer-facing changes (no operator surface impact): - Ratcheting ruff D-rule gate (D100/D101/D102) now enforced project-wide for kneo_serv/**/*.py. New public classes / methods without docstrings fail CI. Forks adding code should follow the Google docstring convention; the chain-reference files are security/secrets.py and platform/manager.py. - Full mypy strict coverage across kneo_serv/. The [[tool.mypy.overrides]] block in pyproject.toml now covers every public module. Forks that subclass or extend public types should expect disallow_untyped_defs + warn_return_any + strict_equality. - mkdocstrings[python]>=0.27 added to the docs optional-dep block. Operators using pip install kneo-serv (without [docs]) are unaffected — the dep is build-time only for the rendered site.

New 0.3.0-feature worked examples: - examples.md picked up a Timeout branches subsection on the human_approval_workflow.yaml entry covering the on_timeout: fail/continue/escalate literals (all wired since 0.3.0). - New examples/run_with_timeout.py walks through start_run_from_spec(..., timeout_seconds=N) + prune_timed_out_runs(). Companion to the human-task timeout example above.

Persistence schemas: unchanged. No new fields, no migrations.

Rolling back

Schema-forward migrations make in-place downgrade unsafe; the only supported rollback path is restore from the pre-upgrade backup, then re-install the previous version.

For the full step-by-step procedure — stop, restore, re-install, restart, verify with the deployment smoke — see backup_and_recovery.md § Rolling back after a failed upgrade.

Keep the pre-upgrade backup until you have verified the new version through at least one business cycle.

Reporting upgrade issues

Capture the same context listed in troubleshooting.md § What to capture before opening a bug, plus:

  • Source version (pip show kneo-serv before the upgrade).
  • Target version (after the upgrade).
  • Migration log lines from the first start on the new version.
  • The exact env file or compose .env (with secrets redacted).