Upgrade guide¶
Conventions for upgrading Kneo Agent Platform (kneo-serv) between
releases, plus version-specific notes when a release has breaking changes.
For the release process itself (gates, tagging, artifacts), see
release_checklist.md. For the supported
kneo_agent SDK range, see sdk_alignment.md.
Versioning¶
Kneo Agent Platform follows semantic versioning:
- Patch (
0.1.0→0.1.1): bug fixes; persistence schemas, route contracts, CLI commands, and env-var names do not change. - Minor (
0.1.x→0.2.0): additive changes. Persistence schemas may add new tables or columns with migrations; routes and CLI may add new surfaces. Existing surfaces remain available with the same shape unless the release notes call out an exception. - Major (
0.x→1.0): may remove or change surfaces. Read the release notes before upgrading; expect to update calling code.
The HTTP API is also versioned at the URL prefix (/v1); legacy unversioned
routes remain available alongside /v1. See
design.md § 13.
Standard upgrade procedure¶
- Read the release notes for every minor/major version between your current and target version. Patch upgrades only need the latest patch's notes. Notes for the current release are at release_notes_0.1.0.md.
- Pin the target version in your dependency manifest:
- Stop traffic to the service (or drain via a load balancer). Background runs that are queued will be reclaimed by the worker after restart; in-flight runs that complete during the drain will record normally.
- Back up persistence. Follow
backup_and_recovery.md(pg_dumpfor PostgreSQL,backup_sqlite_database()for SQLite). Keep the backup until you have verified the new version through at least one business cycle. - Install the new version in your deployment image or environment.
- Restart the service. Migrations apply automatically at startup.
Watch the structured log for
migrationevents and anymigration_failederrors. - Verify with
GET /readyzand the deployment smoke script: - Resume traffic.
If GET /readyz does not return 200 within a few seconds of restart, see
troubleshooting.md § 1.2.
Persistence migrations¶
Every store that has a schema (SQLiteRunStateStore,
PostgresRunStateStore) tracks its schema version and applies
forward-only migrations on first connection. Migrations are
idempotent and never drop columns or rows on their own. The
file-based stores have no schema; they tolerate older record shapes
through the row decoder.
If a migration fails, the service refuses to serve requests rather than running on a partially-migrated schema. Fix the underlying cause (usually a permissions or disk-space problem), then restart.
Downgrades are not supported. Restore from backup if you need to revert.
For contributors authoring new migrations (conventions, the dialect
portability rules, the test patterns), see
docs/dev/migrations.md.
Spec migrations¶
The YAML spec format is versioned at version: v1. The compiler accepts
older shapes through automatic normalization, but for clarity the CLI can
write upgraded specs to disk:
kneo spec migrate legacy_agent.yaml --output migrated_agent.yaml
kneo spec migrate migrated_agent.yaml --check --json
Specs that pass kneo spec validate on the source version will
continue to compile after upgrading; specs that hit deprecation warnings
should be migrated proactively before a future release removes the
fallback.
Signed bundles created with kneo spec bundle sign are tied to the
signing key, not the kneo-serv version, so bundles signed before an
upgrade continue to verify after as long as the signing key is unchanged.
SDK compatibility¶
kneo-serv declares a kneo-agent range in pyproject.toml. When
upgrading kneo-serv, let pip resolve the matching SDK; do not pin SDK
versions outside that range. The compatibility tests
(tests/test_sdk_compatibility.py) assert the SDK surface used by the
service, so a version mismatch surfaces as a test failure.
If you maintain custom runtimes or middlewares that import directly from
kneo_agent, run those compatibility tests after upgrading and update
imports in lockstep.
Configuration changes¶
Environment-variable names and defaults are part of the public surface. Changes are recorded in environment.md and called out in release notes:
- New variables default to behavior consistent with the previous release.
- Renamed variables retain a deprecation alias for at least one minor release; a startup warning is emitted when the alias is used.
- Removed variables are removed only at major versions.
After upgrading, diff your env file against the latest
deploy/production.env.example (or
staging.env.example) to spot any new optional variables.
CLI changes¶
The kneo CLI is regenerated each release; see
cli_reference.md for the current shape. New
subcommands are additive within minor releases. Subcommand behavior may
change at major releases — check the release notes.
CLI profiles stored at ~/.kneo_serv/profiles.json carry forward across
releases. The profile schema is itself versioned and migrated in place.
Version-specific notes¶
This section grows as releases ship. Each entry should describe what changed, what action operators must take, and how to verify the upgrade.
0.1.0 — initial release¶
No upgrade applies; this is the first published version. See release_notes_0.1.0.md for scope, capabilities, and verified release-candidate steps.
0.2.0 — first public distribution¶
This is the first cut to publish a real kneo-serv package. 0.1.0
and 0.1.1 shipped as GitHub Release artifacts only; 0.2.0 is the first
version available via pip install kneo-serv and docker pull
ghcr.io/kneo-agent/kneo-serv.
Version trajectory on PyPI: 0.0.0 → 0.2.0. The kneo-serv 0.0.0
placeholder published on 2026-05-14 reserved the distribution name; it
shipped an empty importable module with no kneo CLI binary (no
[project.scripts] entry). Any user who tried pip install kneo-serv
&& kneo --version during the placeholder window saw kneo: command
not found — 0.2.0 is the first cut to install the binary. The
placeholder is yanked once 0.2.0 ships; existing explicit ==0.0.0
pins still resolve, but default pip install kneo-serv jumps straight
to 0.2.0.
Install paths:
- pip install kneo-serv — first time this works end-to-end.
- docker pull ghcr.io/kneo-agent/kneo-serv:0.2.0 (and :0.2, and
:latest) — first time the image is available without a local
build.
Deployment migration for operators on 0.1.x using compose.yaml
with the bundled build: context: .:
- Default flow becomes docker compose pull && docker compose up -d
against the GHCR image.
- The build: block stays in compose.yaml for contributors and the
CI smoke test (docker compose up --build).
- No required changes to deploy/production.env or
deploy/staging.env from 0.1.1.
Persistence schemas: unchanged from 0.1.1. No migrations required.
Feature additions visible to operators (full per-feature detail in
release_notes_0.2.0.md):
- kneo spec lint — CI-friendly validator subcommand that exits
non-zero on any warnings or errors.
- Retention windows now live in .kneo/config.yaml under a
retention: block, with env vars as the operator override.
- Human-task expiration via
PlatformManager.prune_expired_human_tasks() — paused runs whose
human-step deadline has passed transition to a new expired status
and emit human.expired audit events.
- Two new reference example specs:
concurrent_review_workflow.yaml and group_chat_workflow.yaml.
- Docker-based local PostgreSQL integration testing via
python scripts/postgres_test.py.
No breaking changes to spec syntax, HTTP API contracts, CLI command names, env-var names, or persistence schemas. Specs that validated under 0.1.1 continue to validate under 0.2.0.
0.2.1 — /healthz version and Docker /app permission fix¶
Patch release fixing two regressions discovered while smoke-testing the published 0.2.0 image. Both are bug fixes; no new features, no contract changes.
Upgrade:
- pip install -U kneo-serv (resolves to 0.2.1).
- docker pull ghcr.io/kneo-agent/kneo-serv:0.2.1 — :0.2 and :latest
now resolve to the 0.2.1 digest.
What was broken in 0.2.0:
- GET /healthz returned "version":"0.1.0" from the 0.2.0 image
because HealthResponse.version was a hardcoded string literal. 0.2.1
resolves the field dynamically via importlib.metadata.version("kneo-serv").
- Plain docker run -p 8000:8000 ghcr.io/kneo-agent/kneo-serv:0.2.0
crashed on startup with
PermissionError: [Errno 13] Permission denied: '.kneo' because /app
was root-owned but the container drops to the non-root kneo user
before creating the SQLite-fallback path. 0.2.1 adds
chown -R kneo:kneo /app to the install layer. The Docker Compose
deployment path was unaffected (it pins KNEO_SERV_DATABASE_URL to
PostgreSQL).
Persistence schemas: unchanged from 0.2.0. No migrations required.
No breaking changes to spec syntax, HTTP API contracts, CLI command names, env-var names, or persistence schemas.
0.2.2 — FastAPI info.version fix + post-0.2.0 docs sweep¶
Patch release fixing one regression in the same family as 0.2.1 plus a documentation sweep. No feature changes, no contract changes, no schema changes.
Upgrade:
- pip install -U kneo-serv (resolves to 0.2.2).
- docker pull ghcr.io/kneo-agent/kneo-serv:0.2.2 — :0.2 and
:latest now resolve to the 0.2.2 digest.
What was broken in 0.2.1:
- GET /openapi.json returned info.version: "0.1.0" from the 0.2.1
image because the FastAPI app constructor in
kneo_serv/service/app.py still pinned a hardcoded literal. The
0.2.1 cut fixed HealthResponse.version but missed this parallel
occurrence. 0.2.2 resolves both via the same
importlib.metadata.version("kneo-serv") helper, called at
app-construction time.
Documentation:
- Forward-looking plan docs and "as of 0.1.0" framing in user/dev
docs swept to match the 0.2.x shipped reality. No content lost —
historical files (CHANGELOG entries, shipped release notes,
TODO-0.2.0.md, ADRs) are unchanged.
Persistence schemas: unchanged from 0.2.1. No migrations required.
0.3.0¶
Next additive minor on the 0.2.x line. No breaking changes to spec
syntax, HTTP API contracts, CLI command names, env-var names, or
persistence schemas. Full narrative in
release_notes_0.3.0.md.
Upgrade:
- pip install -U kneo-serv (resolves to 0.3.0).
- docker pull ghcr.io/kneo-agent/kneo-serv:0.3.0 — :0.3 and
:latest now resolve to the 0.3.0 digest. The image is now
signed (cosign keyless via Sigstore) and ships with a CycloneDX
SBOM attestation; verification commands are in
supply_chain_review.md § Verification commands.
SDK floor bump:
- The kneo-agent SDK floor moves from >=1.1.1 to >=1.2.0.
Pip auto-resolves on pip install -U kneo-serv, but operators
pinning the SDK separately (e.g. via a constraints file or a
monorepo lockfile) must ensure their install is on 1.2.0 or
newer. The compat test suite passed against kneo-agent 1.2.0
throughout the 0.2.x line; the floor was kept low to avoid
forcing 0.1.x users to upgrade. 0.3.0 is the natural inflection
point to lift it.
New timed_out lifecycle status:
- Runs that hit their run-level deadline transition to a new
terminal timed_out status (alongside completed, failed,
cancelled, expired). Operator tooling that switches on
state.status should accept it as terminal — e.g. dashboards,
alerting rules, retention sweeps (which the platform's own
RetentionPolicy.run_statuses already includes).
- The error.type field on a timed-out run is run_timed_out,
distinct from human_task_expired (which the existing expired
status uses).
New runtime surfaces:
- start_run_from_spec(..., timeout_seconds=N) and
run_from_spec(..., timeout_seconds=N) accept an optional
wall-clock deadline. Operator-callable
PlatformManager.prune_timed_out_runs() walks runs and force-cancels
those past their deadline. Same operator-cron pattern as
prune_retention() and prune_expired_human_tasks() — no
built-in scheduler.
- The human-task on_timeout: continue and on_timeout: escalate
literals are now wired in the runtime (they were accepted by the
spec but silently treated as fail in 0.2.x). Operators with
specs that declared these literals will see the documented
behaviour for the first time. Audit consumers should expect new
event types: human.continued, human.continue_failed,
human.escalated, run.timed_out.
- New route GET /v1/runs/{run_id}/policy-report returns the spec
policy report for a stored run, no spec bundle required
client-side. Auth: specs:read scope (same as the existing
POST /v1/specs/policy-report).
New observability surfaces:
- Three new platform-side OpenTelemetry spans
(kneo.queue.dispatch, kneo.worker.lease,
kneo.continuation.lock) join the SDK's agent-boundary spans
when KNEO_SERV_OTEL_ENABLED=true. Pre-existing OTel pipelines
pick them up automatically once telemetry is enabled — no extra
configuration required. See
observability.md § Platform-side spans.
Persistence schemas: unchanged. The new RunState.deadline_at
and Checkpoint.iteration fields default to None and 1
respectively in the dataclass, so existing rows round-trip cleanly
through the JSON-payload SQLite / PostgreSQL stores.
0.4.0¶
Next additive minor on the 0.3.x line. No breaking changes to
spec syntax, HTTP API contracts, CLI command names, env-var names,
or persistence schemas. Specs that validated under 0.3.x continue to
validate under 0.4.0. The cut is a docs + tooling release —
runtime semantics are identical to 0.3.0. Full narrative in
release_notes_0.4.0.md.
Upgrade:
- pip install -U kneo-serv (resolves to 0.4.0).
- docker pull ghcr.io/kneo-agent/kneo-serv:0.4.0 — :0.4 and
:latest now resolve to the 0.4.0 digest. Image continues to be
signed (cosign keyless via Sigstore) and ships with a CycloneDX
SBOM attestation; the 0.4.0 cut adds a Trivy CVE scan report
attached to the GitHub Release. Verification commands are in
supply_chain_review.md § Verification commands.
SDK floor: unchanged. The kneo-agent floor stays at >=1.2.0
— same as 0.3.0. No operator action required for operators pinning
the SDK separately.
New auto-generated API reference: the docs site at
kneo-agent.github.io/kneo-serv/ gains a new top-level API
Reference nav section with 17 pages (16 subpackages + sdk),
rendered at build time by mkdocstrings from the Python
docstrings. Operator surface unchanged — the API ref is a
developer lookup surface, not a runtime change. See
docs/api/README.md for the index.
Image vulnerability scanning (Trivy): the release pipeline now
scans the pushed GHCR image with Trivy under the CVSS≥7 policy
(HIGH/CRITICAL findings block the publish step). On every
release-tag build, the JSON scan report is attached to the GitHub
Release as the trivy-report-<version> artifact, 90-day
retention. Deployers can re-run the scan locally with
trivy image ghcr.io/kneo-agent/kneo-serv:<tag>; full policy +
escape hatch documented in
security_hardening.md § Image vulnerability scanning.
Developer-facing changes (no operator surface impact):
- Ratcheting ruff D-rule gate (D100/D101/D102) now enforced
project-wide for kneo_serv/**/*.py. New public classes /
methods without docstrings fail CI. Forks adding code should
follow the Google docstring convention; the chain-reference
files are
security/secrets.py and
platform/manager.py.
- Full mypy strict coverage across kneo_serv/. The
[[tool.mypy.overrides]] block in
pyproject.toml now covers every public
module. Forks that subclass or extend public types should expect
disallow_untyped_defs + warn_return_any + strict_equality.
- mkdocstrings[python]>=0.27 added to the docs optional-dep
block. Operators using pip install kneo-serv (without
[docs]) are unaffected — the dep is build-time only for the
rendered site.
New 0.3.0-feature worked examples:
- examples.md picked up a Timeout branches
subsection on the human_approval_workflow.yaml entry covering
the on_timeout: fail/continue/escalate literals (all wired
since 0.3.0).
- New examples/run_with_timeout.py
walks through start_run_from_spec(..., timeout_seconds=N) +
prune_timed_out_runs(). Companion to the human-task timeout
example above.
Persistence schemas: unchanged. No new fields, no migrations.
Rolling back¶
Schema-forward migrations make in-place downgrade unsafe; the only supported rollback path is restore from the pre-upgrade backup, then re-install the previous version.
For the full step-by-step procedure — stop, restore, re-install, restart,
verify with the deployment smoke — see
backup_and_recovery.md § Rolling back after a failed upgrade.
Keep the pre-upgrade backup until you have verified the new version through at least one business cycle.
Reporting upgrade issues¶
Capture the same context listed in troubleshooting.md § What to capture before opening a bug, plus:
- Source version (
pip show kneo-servbefore the upgrade). - Target version (after the upgrade).
- Migration log lines from the first start on the new version.
- The exact env file or compose
.env(with secrets redacted).