ARC-ADR-031 — Automated Runbook Orchestration: BPMN 2.0 + OASIS CACAO 2.0 on a Shared Kernel in a Function-Tier Container¶
| Field | Value |
|---|---|
| ID | ARC-ADR-031 |
| Status | Accepted |
| Date | 2026-05-28 |
| Deciders | Hub owner (Nicky Clarke) — accepted 2026-05-28 (v1 scope; live doctor 4/4) |
| Supersedes | — |
| Superseded by | — |
| Tags | runbooks, playbooks, bpmn, cacao, soar, function-tier, container, event-bus, security, hitl |
Context and Problem Statement¶
The fleet has an event bus (ARC-ADR-022: NATS JetStream + CloudEvents) and an incident-responder agent persona, but no executable layer that turns a documented response procedure into an automated, event-triggered run. "When event X fires, do A, branch on B, fan out to C and D, escalate to a human at E" lives only in prose and in agents' heads.
Two open, mature standards already describe exactly this:
- BPMN 2.0 (OMG) — XML process orchestration: start/end events, tasks, gateways, sequence flows, message/timer/signal triggers. Strong at event-driven control flow and has a graphical notation.
- OASIS CACAO 2.0 — JSON security playbooks: a
workflowmap of typed steps (action,if/while/switch-condition,parallel,playbook-action), with first-classagent/targetbinding andcommandobjects (ssh,http-api,manual,openc2-http, …). Purpose-built for SOAR.
Neither is a superset of the other (CACAO has no event-wait primitive or diagram; BPMN has no executor/target binding or step variables), and the fleet uses both kinds of artifact: process-shaped runbooks and security playbooks.
The decision: how should the fleet execute automated runbooks — what runtime, what architectural home, what format support, and what command-execution safety posture?
Decision Drivers¶
| # | Driver |
|---|---|
| D1 | Support both standards, not a bespoke DSL. Authors should write BPMN or CACAO, not a proprietary format. Interop with existing SOAR/BPMN tooling matters. |
| D2 | Streamlined + isolated. The headline requirement: a small helper whose only required input is runbook files. No DB, no spoke coupling, runnable standalone with zero network. |
| D3 | Respond to events and triggers. It must subscribe to the existing event bus and fire a runbook on a matching event — but event handling must be an optional layer, not a hard dependency. |
| D4 | Fit ARC-ADR-023 function-tier criteria. Bursty, stateless, independently rolled out, small blast radius — a worker, not a platform service. |
| D5 | Safe by default for security playbooks. A playbook executor that blindly runs ssh/bash against targets is a remote-code-execution engine. The default posture must be observe-and-gate, with real execution opt-in and HITL-gated. |
| D6 | Reuse fleet primitives. NATS JetStream + CloudEvents (ARC-ADR-022) for triggers; HITL Decision Artifacts (ARC-ADR-001/-006) for human gates; structured NDJSON / OTel (ARC-ADR-010) for observability. |
| D7 | Supply-chain minimalism. Security tooling should minimize its dependency surface. Prefer stdlib + a hardened XML parser over a heavyweight workflow engine. |
Considered Options¶
Option A — Hand-rolled minimal interpreter over a shared intermediate representation, Python, function-tier image (recommended)¶
Both formats are parsed into one annotated directed graph (the IR: Node + routing fields). A single ~400-line execution kernel runs the IR; only the parsers and condition evaluators are format-specific. Kernel deps: Python stdlib + defusedxml (XXE-hardened BPMN parsing). The serve layer (FastAPI + nats-py) is optional and degrades gracefully. Default executor is a safe orchestrator: manual → HITL ack, http-api/emit-event → publish CloudEvent, ssh/bash/powershell → dry-run.
Option B — Embed an off-the-shelf BPMN engine (SpiffWorkflow / bpmn-engine)¶
Use a maintained engine for BPMN. Problem: CACAO is not BPMN — no BPMN engine speaks CACAO's step model, agent/target binding, or STIX-pattern conditions, so CACAO would need a separate runtime anyway. SpiffWorkflow is LGPL (license review for a security container); bpmn-engine is Node (a second language in the image). Neither gives the safe-executor posture for free.
Option C — Heavyweight SOAR/orchestration platform (Zeebe/Camunda, or a hosted SOAR)¶
Reject on D2/D4/D7: Zeebe needs a Raft broker cluster (~2 GB RAM minimum, durable RocksDB state) — the opposite of a small isolated stateless helper.
Option D — Status quo: runbooks stay as prose + agent personas¶
No executable layer; every response is hand-driven. Fails the originating requirement.
Decision Outcome¶
Accepted: Option A, with the safe-orchestrator command posture as the shipped v1 default (hub owner decision, 2026-05-28). Real exec / timers / JWS remain open questions for a later version.
The implementation lands as templates/runbook-orchestrator-image/ (function-tier, agentarmy-runbook-orchestrator): one IR, one kernel, two parsers, structural validators, a serve-mode control API + JetStream trigger dispatcher, three bundled fixtures, and an external doctor that proves a bus event fires a runbook end-to-end.
v1 scope¶
In: parse + validate BPMN and CACAO; the IR kernel (START/END/ACTION/DECISION/PARALLEL/CALL + a pass-through CATCH_EVENT in one-shot mode); manual/emit-event/dry-run executors; CLI run/validate; serve mode with NATS event triggers; doctor.
Out (deferred): real ssh/bash execution (opt-in + per-command allowlist + HITL for destructive ops); durable/resumable long waits across restarts; JWS signature verification of playbooks; the full STIX-pattern condition grammar (v1 supports an equality/inequality/numeric subset); timer-driven scheduling; graphical rendering.
Why not the alternatives¶
- B — a second runtime for CACAO defeats the "one small kernel" goal; the IR gives BPMN coverage at a fraction of the surface, and we control the security posture.
- C — violates the isolation/footprint requirement outright.
- D — leaves the originating need unmet.
Affected Layers / Repos¶
| Layer | Repo | Impact |
|---|---|---|
| (infra) | hub | New function-tier image templates/runbook-orchestrator-image/; this ADR; contracts/runbook-orchestrator.openapi.yaml; a docs/contracts.md Registry row; docs/runbook-orchestrator.md |
| middle-core | nickpclarke/middle-core | Hosts the NATS broker (ARC-ADR-001); relates to issue #93 (workflow-orchestrator — runbooks/playbooks as middle-core semantic artifacts). The orchestrator can be the executor those semantic artifacts compile to |
| (cross-cutting) | event bus | Adds producers (fleet.runbook.command, fleet.runbook.completed) and consumers (any fleet.* subject a runbook's start trigger names) |
Pros and Cons of the Options¶
Option A — hand-rolled IR kernel (recommended)¶
Pros: smallest dependency surface (D7); one kernel for both formats; full control of the safe-executor posture (D5); stdlib kernel runs standalone (D2); event layer is optional and degrades (D3); clean function-tier fit (D4). Cons: we own BPMN/CACAO semantic correctness (parallel-join, conditions) rather than inheriting it; v1 deliberately covers a subset (documented limitations: single split/join diamond, condition subset, no durable waits).
Option B — off-the-shelf BPMN engine¶
Pros: mature BPMN semantics for free. Cons: still needs a separate CACAO runtime; license/second-language cost; no safe-executor posture out of the box.
Option C — heavyweight SOAR/Zeebe¶
Pros: durable, scalable orchestration. Cons: fails isolation/footprint/supply-chain drivers decisively.
Option D — status quo¶
Pros: zero cost. Cons: the requirement goes unmet.
Open Questions¶
- Durable long-running waits. A real
intermediateCatchEvent(wait days for an external signal) needs resumable state. v1 passes through in one-shot mode and correlates at the dispatcher in serve mode. When durable waits are needed, where does state live — JetStream KV, a small embedded store, or an external one? (Likely punts to a v2.) - Enabling real execution. What is the opt-in shape for
ssh/bash? Proposed: an env flag plus a per-command allowlist plus HITL routing (ARC-ADR-006) for anything destructive. Does the executor call out to a separate, even-more-isolated exec sidecar rather than running commands in-process? - CACAO trigger grammar. CACAO has no native event-trigger field; v1 uses a non-normative
step_extensions.triggerblock on the start step. Should the fleet standardize that extension, or carry triggers in a sidecar manifest? - Condition engine. Full STIX Patterning (CACAO) and full FEEL (BPMN) are large grammars. v1 supports a comparison subset. Do we adopt a vetted library per language, or grow the subset as runbooks demand?
- Relationship to middle-core issue #93. Should middle-core's semantic runbook/playbook artifacts compile to CACAO/BPMN that this engine executes, making the orchestrator the runtime for the ontology layer?
- Signature verification. CACAO supports JWS signing (TLP markings, signed playbooks). When do we verify signatures before executing a playbook from an untrusted source?
Related Decisions¶
- ARC-ADR-023: Container tiering — the orchestrator is a function-tier image.
- ARC-ADR-022: Event bus bridges — NATS JetStream + CloudEvents, the trigger transport.
- ARC-ADR-001 / ARC-ADR-006: HITL decision points / destructive ops — the gate for
manualsteps and (future) real destructive execution. - ARC-ADR-010: Observability standard — structured NDJSON traces + OTel spans.
- ARC-ADR-029: Forge — the reference function-tier image pattern this one mirrors (image.json + setup + doctor).
- middle-core #93 —
workflow-orchestratorrunbooks/playbooks as middle-core semantic artifacts.
Revision History¶
| Version | Date | Author | Change |
|---|---|---|---|
| 0.1 | 2026-05-28 | Claude Code (assisted) | Initial Proposed from interactive design session: research (BPMN 2.0 + CACAO 2.0) → shared-IR kernel design → function-tier image scaffold + safe-orchestrator posture |