If you’ve built an agent, you know the split. The model is ~10% of the work. The other 90% is scaffolding: retries, timeouts, rate limits, cost caps, sandboxing, human review, observability, audit logs. Workflow engines have been doing that 90% for a decade. m9m is the runtime that lets you reuse it — without giving up on the agent-shaped nodes you actually want.

This article walks through how agent orchestration works in m9m: what a CLI node looks like, how MCP integration works, how checkpoints survive crashes, and how to wire a human-review step.

The CLI agent node

A CLI node runs a command-line tool (Claude Code, Codex, Aider, or your own binary) inside a sandboxed environment. It takes a sandbox spec and a prompt; it returns stdout, stderr, exit code, and any file artifacts.

{
  "id": "review",
  "type": "cli.claude-code",
  "params": {
    "sandbox": true,
    "cpu": 2,
    "memory": "2Gi",
    "network": "restricted",
    "workdir": "/tmp/work",
    "prompt": "Review this diff for security issues:\n{{ $node.fetch.diff }}",
    "timeout": "5m",
    "allowed_commands": ["rg", "cat", "git"]
  }
}

Under the hood, m9m starts a process in a fresh Linux namespace (PID, mount, network, UTS, IPC, user), applies cgroup limits for CPU and memory, restricts network access to an allow-list (or denies it entirely), and mounts the working directory read-write while the rest of the filesystem is either read-only or invisible. When the agent exits, the namespace is torn down. Nothing persists.

MCP — when the agent manages the agent

m9m ships an MCP server. Start it with m9m serve --mcp --port 8080 and any MCP-aware LLM can discover 37 tools:

workflow.list, workflow.get, workflow.create, workflow.update, workflow.delete
run.start, run.cancel, run.logs, run.retry, run.artifacts
node.list, node.describe
credential.list, credential.use
audit.search, audit.stream
… and more.

This unlocks a pattern we keep reaching for: one m9m workflow is the orchestrator, and sub-agents talk to it over MCP. The orchestrator handles scheduling, retries, and checkpointing; the agents handle reasoning. Failures surface as structured run logs; the orchestrator decides whether to re-run, escalate to a human, or give up.

Every MCP action is audited. Every workflow edit is versioned in Git. That last point matters more than people expect — it means an agent can refactor a workflow and you get a PR-shaped record of what changed and why.

Checkpoints and human review

A workflow can include a human.review node. When execution reaches it, m9m pauses and emits a resumable token. The workflow stops consuming resources. When a human approves (via the UI, API, or another workflow), the run resumes from exactly that point. Same mechanism handles crashes — state is checkpointed between nodes, so a killed process resumes from the last checkpoint, not from the top.

This is the piece most ad-hoc agent frameworks skip, and it is the piece that matters most in production. Agents fail. You need to pause, show a human the evidence, and let them decide — without re-running the first twenty minutes of work.

Cost controls

Three layers:

Per-run — a workflow can declare a max cost; m9m refuses to start LLM nodes that would exceed it.
Per-workflow — rolling budgets over a time window.
Per-workspace — tenant-level caps for multi-team deployments.

Cost is tracked per-node, per-run, and exposed as Prometheus metrics.

A worked example: PR reviewer

A common starter setup.

{
  "trigger": { "type": "webhook", "path": "/github/pr" },
  "nodes": [
    { "id": "verify",   "type": "github.webhook.verify", "params": { "secret": "{{ $cred.github_webhook }}" } },
    { "id": "fetch",    "type": "github.pr.get", "params": { "pr": "{{ $json.pr_url }}" } },
    { "id": "review",   "type": "cli.claude-code", "params": {
        "sandbox": true, "cpu": 2, "memory": "2Gi",
        "prompt": "Review this diff. Focus on security and data-loss risks:\n{{ $node.fetch.diff }}",
        "timeout": "5m"
    }},
    { "id": "approve",  "type": "human.review", "params": { "notify": "#eng-reviews" } },
    { "id": "post",     "type": "github.pr.comment", "params": { "body": "{{ $node.review.output }}" } }
  ]
}

Webhook from GitHub → signature verify → pull diff → sandboxed Claude Code → human approval in Slack → posted comment. One JSON file. One binary. If it crashes mid-review, it resumes from the last checkpoint. If the human declines, the PR comment never goes out.

Where this doesn’t fit

Research agents doing exploratory work in a notebook. m9m is overkill — run your Python script.
Agents with multi-hour reasoning chains that need fine-grained retry semantics per sub-step. Look at LangGraph or a custom Python orchestrator. (You can still run those inside an m9m CLI node, though — and you probably should in production.)
Pure research without deployment pressure. Skip the scaffolding entirely.

Agent orchestration with m9m: sandboxing, MCP, and human review in one binary

The CLI agent node

MCP — when the agent manages the agent

Checkpoints and human review

Cost controls

A worked example: PR reviewer

Where this doesn’t fit

Need help shipping agents or migrating off n8n?

The CLI agent node

MCP — when the agent manages the agent

Checkpoints and human review

Cost controls

A worked example: PR reviewer

Where this doesn’t fit

Related

Need help shipping agents or migrating off n8n?