Skip to content

Human-in-the-loop walkthrough

An end-to-end guide to pause/resume workflows: declaring a human step in YAML, capturing the pause as a continuation, listing pending tasks, and resuming with a decision.

The architectural rationale lives in design.md § 8.5; this page is the operational walkthrough.

What "human-in-the-loop" means here

A workflow can include steps that block on a human decision. When the runtime reaches such a step it raises HumanInterventionRequired, which the platform catches, serializes into a WorkflowContinuation, and exposes to operators. The run goes from running to paused. A subsequent resume call provides the decision and the workflow continues from where it stopped, using the persisted replay context.

sequenceDiagram
    autonumber
    participant Caller
    participant Service
    participant Workflow as Workflow runtime
    participant Cont as ContinuationStore
    participant Reviewer

    Caller->>Service: POST /runs
    Service->>Workflow: execute
    Workflow-->>Service: HumanInterventionRequired<br/>(continuation_id, request_id)
    Service->>Cont: save continuation + checkpoint
    Service-->>Caller: 202 with continuation_id

    Reviewer->>Service: GET /human-tasks
    Service-->>Reviewer: list (continuation, request, prompt, deadline)

    Reviewer->>Service: POST /human-tasks/{id}/resume<br/>(approve|reject + content)
    Service->>Cont: lock + load continuation
    Service->>Workflow: resume with decision
    Workflow-->>Service: RunResult
    Service-->>Reviewer: 200 result

1 · Declare a human step in a spec

examples/human_approval_workflow.yaml is the reference example. The relevant pieces:

workflow:
  type: sequential
  steps:
    - id: draft
      kind: function
      ref: draft_report
    - id: approve
      kind: human            # the pause point
      ref: approval-reviewer
    - id: publish
      kind: function
      ref: publish_report

components:
  humans:
    approval-reviewer:
      description: Please approve or edit the draft report.
      assignee: reviewer@example.com
      timeout_seconds: 86400
      on_timeout: escalate

A human step references an entry under components.humans, which provides the prompt, the assignee, and the timeout policy. The assignee is metadata for the operator UI; the platform doesn't dispatch notifications itself — wire that up at your routing layer.

2 · Run the workflow until pause

kneo run examples/human_approval_workflow.yaml \
  --input "draft the launch announcement" \
  --target workflow --json

When the workflow pauses on a human step, the response includes the run id, status, the continuation_id to resume against, and the pending request nested in metadata.pending_human_request:

{
  "run_id": "run_2026-05-10T12:34:56_…",
  "status": "paused",
  "output_text": null,
  "human_intervention_required": true,
  "continuation_id": "cont_…",
  "metadata": {
    "pending_human_request": {
      "request_id": "req_…",
      "prompt": "Approve the draft?"
    }
  }
}

If you'd rather drive the API directly:

curl -sf -X POST http://127.0.0.1:8000/v1/runs \
  -H "Authorization: Bearer $KNEO_SERV_API_KEY" \
  -H 'Content-Type: application/json' \
  -d '{
    "input": "draft the launch announcement",
    "spec_path": "examples/human_approval_workflow.yaml",
    "target": "workflow"
  }'

The response carries the same continuation_id and request_id fields.

3 · List and inspect pending tasks

kneo human list --profile local
kneo human get <continuation_id>

Or via the API:

curl -sf "http://127.0.0.1:8000/v1/human-tasks?run_id=<run_id>" \
  -H "Authorization: Bearer $KNEO_SERV_API_KEY" | jq

curl -sf "http://127.0.0.1:8000/v1/human-tasks/<continuation_id>" \
  -H "Authorization: Bearer $KNEO_SERV_API_KEY" | jq

Listing requires the human:read scope; resuming requires human:write. See production_readiness_review.md § Route Scope Matrix.

4 · Resume with a decision

kneo human resume <continuation_id> \
  --request-id <request_id> \
  --approve

Or with structured content over HTTP:

curl -sf -X POST "http://127.0.0.1:8000/v1/human-tasks/<continuation_id>/resume" \
  -H "Authorization: Bearer $KNEO_SERV_API_KEY" \
  -H 'Content-Type: application/json' \
  -d '{
    "request_id": "<request_id>",
    "decision": "approved",
    "content": "Looks good. Ship it."
  }'

decision must be one of approved, rejected, edited, selected, or provided (past-tense — the CLI's --approve, --reject, --edit, --select, and --provide flags map to these). selected pairs with a selected_option; edited and provided pair with edited/provided content. Which decisions a step accepts depends on its component definition; the full schema is in service_api.md.

5 · Idempotent resume

POST /human-tasks/{id}/resume accepts an Idempotency-Key header. If you send the same key with the same body, the platform replays the original response instead of re-executing the resume. Mismatched bodies return 409 idempotency_key_conflict.

curl -sf -X POST "http://127.0.0.1:8000/v1/human-tasks/<continuation_id>/resume" \
  -H "Authorization: Bearer $KNEO_SERV_API_KEY" \
  -H "Idempotency-Key: $(uuidgen)" \
  -H 'Content-Type: application/json' \
  -d '...'

In the CLI, KNEO_SERV_IDEMPOTENCY_KEY provides the header.

6 · Process-safe locking

The platform acquires a per-continuation lock before executing a resume. Two callers hitting resume on the same continuation see one succeed and the other receive LockAcquisitionError. This guarantees the same human task cannot be acted on twice.

If you see this error, wait for the in-flight resume to finish; do not retry blindly. See troubleshooting.md § 8.1.

7 · Audit trail

Every human decision records an audit event:

curl -sf "http://127.0.0.1:8000/v1/audit-events?event_type=human.decision" \
  -H "Authorization: Bearer $KNEO_SERV_API_KEY" | jq

The payload records request_id, decision, selected option, result status, and whether content was present — never the content itself. See the audit policy in production_readiness_review.md § Audit Payload Review.

8 · Recovery after restart

WorkflowContinuation is persisted in ContinuationStore. If the service restarts mid-run, the paused continuation survives. After restart:

  • Listing /human-tasks returns the same continuation.
  • Resume picks up from the persisted replay context — the workflow does not retry completed steps.

If a non-human workflow is interrupted (e.g., a worker crash), the same mechanism enables continuation. Check /runs/{id}/recovery to see whether continuation is available, then call /runs/{id}/continue.

9 · Timeouts and escalation

components.humans.<id>.timeout_seconds and on_timeout declare the policy. When a sequential workflow pauses on a human step that declares a positive timeout_seconds, the platform computes expires_at = pause_time + timeout_seconds and stashes it on the continuation along with the chosen on_timeout value. The prune step below dispatches on that value (fail / continue / escalate). Auto-routing of an escalated task to a different reviewer is still up to the operator's external workflow — the platform marks the task as escalated and emits an audit event, but does not auto-reassign.

Expiring paused runs

Call PlatformManager.prune_expired_human_tasks() on whatever cadence the deployment needs (cron, scheduled run, manual operator action — the same pattern as prune_retention(); there is no built-in scheduler).

For each saved continuation whose expires_at is in the past and whose underlying run is still blocked, the prune dispatches on pending_human_request["on_timeout"]:

  • fail (default): marks the run expired (a lifecycle status alongside failed / cancelled); run.error.type is human_task_expired with a message recording the configured timeout. Records a human.expired audit event with run_id, continuation_id, timeout_seconds, expires_at, expired_at (the cutoff used), and on_timeout. Deletes the continuation.
  • continue: synthesizes an approved HumanResponse carrying metadata.auto_continued = true, metadata.reason = "timeout", and the original metadata.original_assignee, then resumes the workflow past the paused step. Records a human.continued audit event before the resume attempt; if the resume itself raises, also records human.continue_failed (with error_type and error_message) and deletes the continuation so the prune does not retry indefinitely.
  • escalate: keeps the run blocked and stamps pending_human_request["escalated_at"] (plus a copy of the original expires_at as original_expires_at) on the continuation. Records a human.escalated audit event including the original assignee. Subsequent prune calls skip continuations carrying the escalated_at marker — escalation fires once. The continuation stays alive until an operator resumes it via the normal /continuations/{id}/resume path (typically after reassigning to a different reviewer in the operator's external workflow).

Runs that have already resumed before the prune fires (status no longer blocked) are skipped on every branch — the resume path owns the terminal state. Calls are idempotent on the fail and escalate paths; the continue path is single-shot per continuation by construction (success deletes via the resume; failure deletes explicitly).

prune_expired_human_tasks() returns the count of continuations processed this call across all branches. Already-escalated continuations are skipped and do not contribute to the count.

Run-level timeouts vs. human-task timeouts

Two independent timeouts can apply to a run that's blocked on a human step:

  • Human-task timeout (components.humans.<id>.timeout_seconds + on_timeout): bounded waiting time on this specific human step. Stored on the continuation as expires_at. Handled by prune_expired_human_tasks() per the dispatch above.
  • Run-level timeout (start_run_from_spec(..., timeout_seconds=N)): bounded wall-clock for the whole run, no matter which step it's currently on. Stored on the run state as deadline_at. Handled by PlatformManager.prune_timed_out_runs().

Whichever fires first wins. If the run-level prune fires while the run is blocked on a human task, the run transitions to timed_out (not expired), the continuation is deleted, and a run.timed_out audit event is recorded — the human-task path is preempted. If the human-task prune fires first, the human-task on_timeout semantics apply as documented above; the run-level deadline becomes irrelevant because the run is no longer in blocked status.

The timed_out lifecycle status is distinct from expired: expired means a human task missed its deadline (the run was waiting on a human); timed_out means the run as a whole missed its deadline (could have been waiting on a human, could have been mid-execution).

Common failure modes

Symptom See
LockAcquisitionError on resume troubleshooting.md § 8.1
404 on continuation_id troubleshooting.md § 8.2
409 idempotency_key_conflict troubleshooting.md § 5.4
403 Missing required scope: human:write troubleshooting.md § 4.2

See also

  • examples.md — the full set of runnable specs.
  • service_api.md/human-tasks route shapes.
  • design.md § 4.3 — design rationale.