m9m
Article

Agent orchestration with m9m: sandboxing, MCP, and human review in one binary

How to run production AI agents on m9m — sandboxed CLI nodes for Claude Code / Codex / Aider, MCP tool exposure, checkpoints, and human-in-the-loop flows.

Neul Labs ·
#agents#mcp#sandboxing#claude-code#m9m

If you’ve built an agent, you know the split. The model is ~10% of the work. The other 90% is scaffolding: retries, timeouts, rate limits, cost caps, sandboxing, human review, observability, audit logs. Workflow engines have been doing that 90% for a decade. m9m is the runtime that lets you reuse it — without giving up on the agent-shaped nodes you actually want.

This article walks through how agent orchestration works in m9m: what a CLI node looks like, how MCP integration works, how checkpoints survive crashes, and how to wire a human-review step.

The CLI agent node

A CLI node runs a command-line tool (Claude Code, Codex, Aider, or your own binary) inside a sandboxed environment. It takes a sandbox spec and a prompt; it returns stdout, stderr, exit code, and any file artifacts.

{
  "id": "review",
  "type": "cli.claude-code",
  "params": {
    "sandbox": true,
    "cpu": 2,
    "memory": "2Gi",
    "network": "restricted",
    "workdir": "/tmp/work",
    "prompt": "Review this diff for security issues:\n{{ $node.fetch.diff }}",
    "timeout": "5m",
    "allowed_commands": ["rg", "cat", "git"]
  }
}

Under the hood, m9m starts a process in a fresh Linux namespace (PID, mount, network, UTS, IPC, user), applies cgroup limits for CPU and memory, restricts network access to an allow-list (or denies it entirely), and mounts the working directory read-write while the rest of the filesystem is either read-only or invisible. When the agent exits, the namespace is torn down. Nothing persists.

MCP — when the agent manages the agent

m9m ships an MCP server. Start it with m9m serve --mcp --port 8080 and any MCP-aware LLM can discover 37 tools:

  • workflow.list, workflow.get, workflow.create, workflow.update, workflow.delete
  • run.start, run.cancel, run.logs, run.retry, run.artifacts
  • node.list, node.describe
  • credential.list, credential.use
  • audit.search, audit.stream
  • … and more.

This unlocks a pattern we keep reaching for: one m9m workflow is the orchestrator, and sub-agents talk to it over MCP. The orchestrator handles scheduling, retries, and checkpointing; the agents handle reasoning. Failures surface as structured run logs; the orchestrator decides whether to re-run, escalate to a human, or give up.

Every MCP action is audited. Every workflow edit is versioned in Git. That last point matters more than people expect — it means an agent can refactor a workflow and you get a PR-shaped record of what changed and why.

Checkpoints and human review

A workflow can include a human.review node. When execution reaches it, m9m pauses and emits a resumable token. The workflow stops consuming resources. When a human approves (via the UI, API, or another workflow), the run resumes from exactly that point. Same mechanism handles crashes — state is checkpointed between nodes, so a killed process resumes from the last checkpoint, not from the top.

This is the piece most ad-hoc agent frameworks skip, and it is the piece that matters most in production. Agents fail. You need to pause, show a human the evidence, and let them decide — without re-running the first twenty minutes of work.

Cost controls

Three layers:

  1. Per-run — a workflow can declare a max cost; m9m refuses to start LLM nodes that would exceed it.
  2. Per-workflow — rolling budgets over a time window.
  3. Per-workspace — tenant-level caps for multi-team deployments.

Cost is tracked per-node, per-run, and exposed as Prometheus metrics.

A worked example: PR reviewer

A common starter setup.

{
  "trigger": { "type": "webhook", "path": "/github/pr" },
  "nodes": [
    { "id": "verify",   "type": "github.webhook.verify", "params": { "secret": "{{ $cred.github_webhook }}" } },
    { "id": "fetch",    "type": "github.pr.get", "params": { "pr": "{{ $json.pr_url }}" } },
    { "id": "review",   "type": "cli.claude-code", "params": {
        "sandbox": true, "cpu": 2, "memory": "2Gi",
        "prompt": "Review this diff. Focus on security and data-loss risks:\n{{ $node.fetch.diff }}",
        "timeout": "5m"
    }},
    { "id": "approve",  "type": "human.review", "params": { "notify": "#eng-reviews" } },
    { "id": "post",     "type": "github.pr.comment", "params": { "body": "{{ $node.review.output }}" } }
  ]
}

Webhook from GitHub → signature verify → pull diff → sandboxed Claude Code → human approval in Slack → posted comment. One JSON file. One binary. If it crashes mid-review, it resumes from the last checkpoint. If the human declines, the PR comment never goes out.

Where this doesn’t fit

  • Research agents doing exploratory work in a notebook. m9m is overkill — run your Python script.
  • Agents with multi-hour reasoning chains that need fine-grained retry semantics per sub-step. Look at LangGraph or a custom Python orchestrator. (You can still run those inside an m9m CLI node, though — and you probably should in production.)
  • Pure research without deployment pressure. Skip the scaffolding entirely.

Need help shipping agents or migrating off n8n?

Neul Labs — the team behind m9m — takes on a limited number of consulting engagements each quarter. We help teams migrate n8n workflows, build custom Go nodes, sandbox AI agents in production, and design automation platforms that don't collapse under load.