Human-in-the-loop walkthrough¶
An end-to-end guide to pause/resume workflows: declaring a human step in YAML, capturing the pause as a continuation, listing pending tasks, and resuming with a decision.
The architectural rationale lives in design.md § 8.5; this page is the operational walkthrough.
What "human-in-the-loop" means here¶
A workflow can include steps that block on a human decision. When the
runtime reaches such a step it raises HumanInterventionRequired, which
the platform catches, serializes into a WorkflowContinuation, and
exposes to operators. The run goes from running to paused. A
subsequent resume call provides the decision and the workflow continues
from where it stopped, using the persisted replay context.
sequenceDiagram
autonumber
participant Caller
participant Service
participant Workflow as Workflow runtime
participant Cont as ContinuationStore
participant Reviewer
Caller->>Service: POST /runs
Service->>Workflow: execute
Workflow-->>Service: HumanInterventionRequired<br/>(continuation_id, request_id)
Service->>Cont: save continuation + checkpoint
Service-->>Caller: 202 with continuation_id
Reviewer->>Service: GET /human-tasks
Service-->>Reviewer: list (continuation, request, prompt, deadline)
Reviewer->>Service: POST /human-tasks/{id}/resume<br/>(approve|reject + content)
Service->>Cont: lock + load continuation
Service->>Workflow: resume with decision
Workflow-->>Service: RunResult
Service-->>Reviewer: 200 result
1 · Declare a human step in a spec¶
examples/human_approval_workflow.yaml
is the reference example. The relevant pieces:
workflow:
type: sequential
steps:
- id: draft
kind: function
ref: draft_report
- id: approve
kind: human # the pause point
ref: approval-reviewer
- id: publish
kind: function
ref: publish_report
components:
humans:
approval-reviewer:
description: Please approve or edit the draft report.
assignee: reviewer@example.com
timeout_seconds: 86400
on_timeout: escalate
A human step references an entry under components.humans, which
provides the prompt, the assignee, and the timeout policy. The assignee
is metadata for the operator UI; the platform doesn't dispatch
notifications itself — wire that up at your routing layer.
2 · Run the workflow until pause¶
kneo run examples/human_approval_workflow.yaml \
--input "draft the launch announcement" \
--target workflow --json
When the workflow pauses on a human step, the response includes the run id,
status, the continuation_id to resume against, and the pending request
nested in metadata.pending_human_request:
{
"run_id": "run_2026-05-10T12:34:56_…",
"status": "paused",
"output_text": null,
"human_intervention_required": true,
"continuation_id": "cont_…",
"metadata": {
"pending_human_request": {
"request_id": "req_…",
"prompt": "Approve the draft?"
}
}
}
If you'd rather drive the API directly:
curl -sf -X POST http://127.0.0.1:8000/v1/runs \
-H "Authorization: Bearer $KNEO_SERV_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"input": "draft the launch announcement",
"spec_path": "examples/human_approval_workflow.yaml",
"target": "workflow"
}'
The response carries the same continuation_id and request_id fields.
3 · List and inspect pending tasks¶
Or via the API:
curl -sf "http://127.0.0.1:8000/v1/human-tasks?run_id=<run_id>" \
-H "Authorization: Bearer $KNEO_SERV_API_KEY" | jq
curl -sf "http://127.0.0.1:8000/v1/human-tasks/<continuation_id>" \
-H "Authorization: Bearer $KNEO_SERV_API_KEY" | jq
Listing requires the human:read scope; resuming requires
human:write. See
production_readiness_review.md § Route Scope Matrix.
4 · Resume with a decision¶
Or with structured content over HTTP:
curl -sf -X POST "http://127.0.0.1:8000/v1/human-tasks/<continuation_id>/resume" \
-H "Authorization: Bearer $KNEO_SERV_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"request_id": "<request_id>",
"decision": "approved",
"content": "Looks good. Ship it."
}'
decision must be one of approved, rejected, edited, selected, or
provided (past-tense — the CLI's --approve, --reject, --edit,
--select, and --provide flags map to these). selected pairs with a
selected_option; edited and provided pair with edited/provided
content. Which decisions a step accepts depends on its component
definition; the full schema is in
service_api.md.
5 · Idempotent resume¶
POST /human-tasks/{id}/resume accepts an Idempotency-Key header. If
you send the same key with the same body, the platform replays the
original response instead of re-executing the resume. Mismatched bodies
return 409 idempotency_key_conflict.
curl -sf -X POST "http://127.0.0.1:8000/v1/human-tasks/<continuation_id>/resume" \
-H "Authorization: Bearer $KNEO_SERV_API_KEY" \
-H "Idempotency-Key: $(uuidgen)" \
-H 'Content-Type: application/json' \
-d '...'
In the CLI, KNEO_SERV_IDEMPOTENCY_KEY provides the header.
6 · Process-safe locking¶
The platform acquires a per-continuation lock before executing a resume.
Two callers hitting resume on the same continuation see one succeed and
the other receive LockAcquisitionError. This guarantees the same human
task cannot be acted on twice.
If you see this error, wait for the in-flight resume to finish; do not retry blindly. See troubleshooting.md § 8.1.
7 · Audit trail¶
Every human decision records an audit event:
curl -sf "http://127.0.0.1:8000/v1/audit-events?event_type=human.decision" \
-H "Authorization: Bearer $KNEO_SERV_API_KEY" | jq
The payload records request_id, decision, selected option, result
status, and whether content was present — never the content itself.
See the audit policy in
production_readiness_review.md § Audit Payload Review.
8 · Recovery after restart¶
WorkflowContinuation is persisted in ContinuationStore. If the
service restarts mid-run, the paused continuation survives. After
restart:
- Listing
/human-tasksreturns the same continuation. - Resume picks up from the persisted replay context — the workflow does not retry completed steps.
If a non-human workflow is interrupted (e.g., a worker crash), the same
mechanism enables continuation. Check /runs/{id}/recovery to see
whether continuation is available, then call /runs/{id}/continue.
9 · Timeouts and escalation¶
components.humans.<id>.timeout_seconds and on_timeout declare the
policy. When a sequential workflow pauses on a human step that declares
a positive timeout_seconds, the platform computes
expires_at = pause_time + timeout_seconds and stashes it on the
continuation along with the chosen on_timeout value. The prune step
below dispatches on that value (fail / continue / escalate).
Auto-routing of an escalated task to a different reviewer is still up
to the operator's external workflow — the platform marks the task as
escalated and emits an audit event, but does not auto-reassign.
Expiring paused runs¶
Call PlatformManager.prune_expired_human_tasks() on whatever cadence
the deployment needs (cron, scheduled run, manual operator action — the
same pattern as prune_retention(); there is no built-in scheduler).
For each saved continuation whose expires_at is in the past and whose
underlying run is still blocked, the prune dispatches on
pending_human_request["on_timeout"]:
fail(default): marks the runexpired(a lifecycle status alongsidefailed/cancelled);run.error.typeishuman_task_expiredwith a message recording the configured timeout. Records ahuman.expiredaudit event withrun_id,continuation_id,timeout_seconds,expires_at,expired_at(the cutoff used), andon_timeout. Deletes the continuation.continue: synthesizes anapprovedHumanResponsecarryingmetadata.auto_continued = true,metadata.reason = "timeout", and the originalmetadata.original_assignee, then resumes the workflow past the paused step. Records ahuman.continuedaudit event before the resume attempt; if the resume itself raises, also recordshuman.continue_failed(witherror_typeanderror_message) and deletes the continuation so the prune does not retry indefinitely.escalate: keeps the runblockedand stampspending_human_request["escalated_at"](plus a copy of the originalexpires_atasoriginal_expires_at) on the continuation. Records ahuman.escalatedaudit event including the originalassignee. Subsequent prune calls skip continuations carrying theescalated_atmarker — escalation fires once. The continuation stays alive until an operator resumes it via the normal/continuations/{id}/resumepath (typically after reassigning to a different reviewer in the operator's external workflow).
Runs that have already resumed before the prune fires (status no
longer blocked) are skipped on every branch — the resume path owns
the terminal state. Calls are idempotent on the fail and escalate
paths; the continue path is single-shot per continuation by
construction (success deletes via the resume; failure deletes
explicitly).
prune_expired_human_tasks() returns the count of continuations
processed this call across all branches. Already-escalated
continuations are skipped and do not contribute to the count.
Run-level timeouts vs. human-task timeouts¶
Two independent timeouts can apply to a run that's blocked on a human step:
- Human-task timeout (
components.humans.<id>.timeout_seconds+on_timeout): bounded waiting time on this specific human step. Stored on the continuation asexpires_at. Handled byprune_expired_human_tasks()per the dispatch above. - Run-level timeout (
start_run_from_spec(..., timeout_seconds=N)): bounded wall-clock for the whole run, no matter which step it's currently on. Stored on the run state asdeadline_at. Handled byPlatformManager.prune_timed_out_runs().
Whichever fires first wins. If the run-level prune fires while the
run is blocked on a human task, the run transitions to timed_out
(not expired), the continuation is deleted, and a run.timed_out
audit event is recorded — the human-task path is preempted. If the
human-task prune fires first, the human-task on_timeout semantics
apply as documented above; the run-level deadline becomes irrelevant
because the run is no longer in blocked status.
The timed_out lifecycle status is distinct from expired:
expired means a human task missed its deadline (the run was waiting
on a human); timed_out means the run as a whole missed its
deadline (could have been waiting on a human, could have been
mid-execution).
Common failure modes¶
| Symptom | See |
|---|---|
LockAcquisitionError on resume |
troubleshooting.md § 8.1 |
| 404 on continuation_id | troubleshooting.md § 8.2 |
409 idempotency_key_conflict |
troubleshooting.md § 5.4 |
403 Missing required scope: human:write |
troubleshooting.md § 4.2 |
See also¶
- examples.md — the full set of runnable specs.
- service_api.md —
/human-tasksroute shapes. - design.md § 4.3 — design rationale.