AgentHarness: an OpenClaw SRE Sandbox on kind, by Tom O'Rourke

kagent's newest primitive, the AgentHarness, does something the Agent and SandboxAgent types do not: it has no runtime baked in. Instead kagent asks an OpenShell gateway to provision a long-lived OpenClaw sandbox that you attach to. This lab gives that sandbox a real job. It is your on-call SRE: you ask it, in natural language, to triage the cluster it is running in and fix what is broken. The interesting part is the boundary. What the sandbox is allowed to change is decided by Kubernetes RBAC, not by how nicely you phrase the prompt.

What an AgentHarness is

A normal kagent Agent bundles a runtime and serves requests. An AgentHarness is different: it is a declaration that says "stand me up a sandbox of this kind, keep it ready, and surface it next to my agents." The backend owns the environment's lifecycle. On the verified CRD the shape is small:

yamlyaml/agentharness.yaml

apiVersion: kagent.dev/v1alpha2
kind: AgentHarness
metadata:
  name: sre-oncall
  namespace: kagent
spec:
  backend: openclaw            # openclaw | nemoclaw
  description: "SRE on-call sandbox - attach, triage the cluster, and fix it"
  modelConfigRef: anthropic-haiku
  network:
    allowedDomains:
      - api.anthropic.com

When this is accepted, kagent calls OpenShell, OpenShell provisions an OpenClaw sandbox pod, and the harness reports Ready=True with a connection endpoint. From there it is a shell you attach to, and OpenClaw is a capable agent with its own tools.

What the lab builds

One kind cluster, brought up by an idempotent quick.sh up:

Gateway API CRDs and the upstream agent-sandbox controller that OpenShell builds on.
OpenShell gateway, the backend the AgentHarness talks to over gRPC.
kagent OSS, with its controller pointed at OpenShell so the AgentHarness backend is registered.
A ModelConfig (Anthropic) and the AgentHarness itself, which produces the OpenClaw sandbox.
RBAC: cluster-wide read for triage, plus a fix permission that is bound only where a namespace opts in.
Two app namespaces, each with the same deliberately broken Deployment.

The only secret you must provide is your own model key. A Slack webhook is optional. Nothing secret is stored in the repo.

Bring it up

export ANTHROPIC_API_KEY=sk-ant-...
export SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...   # optional

./scripts/quick.sh up

First run pulls a few large images (the OpenShell gateway and supervisor, kagent, and the OpenClaw sandbox base), so give it a few minutes. When it finishes, the harness is ready and a checkout Deployment is broken on purpose in both namespaces: it is pinned to an image tag that does not exist, so its pods stick in ImagePullBackOff.

yamlyaml/broken-app/deployment.yaml — applied into incident + payments

apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkout
  labels: { app: checkout }
spec:
  replicas: 1
  selector:
    matchLabels: { app: checkout }
  template:
    metadata:
      labels: { app: checkout }
    spec:
      containers:
        - name: checkout
          image: nginx:9.99-doesnotexist   # BROKEN ON PURPOSE: tag does not exist
          ports:
            - containerPort: 80

kubectl --context kind-harness -n kagent get agentharness sre-oncall
# NAME         BACKEND    READY   ID                  AGE
# sre-oncall   openclaw   True    kagent-sre-oncall   1m

Ask OpenClaw to triage

Interaction is a one-liner. ask.sh streams OpenClaw's reasoning and its actions back to your terminal. Start with a read-only question:

./scripts/ask.sh "what is broken in the cluster?"

Under the hood this attaches to the sandbox and runs an OpenClaw agent turn. OpenClaw lists namespaces, finds the ImagePullBackOff pods, and describes the root cause it sees: an image tag that does not exist. Now ask it to remediate the whole cluster, and tell it how to escalate anything it cannot fix:

./scripts/ask.sh "Triage every namespace for broken workloads. Fix what you are \
permitted to. If Kubernetes denies a change (403), do NOT force it - post a concise \
summary to the Slack webhook in /sandbox/.slack-webhook via curl. Summarize what you \
fixed and what you escalated."

The guardrail is RBAC, keyed off a namespace label

The same fault exists in two namespaces. The only difference between them is one label.

may fix

incident

label: autofix=true
read: yes (triage)
patch: yes
outcome: OpenClaw sets a valid image, pod recovers

triage only

payments

label: none
read: yes (triage)
patch: 403 Forbidden
outcome: OpenClaw escalates to Slack, leaves it untouched

The only difference between the two namespaces is a label:

yamlyaml/namespaces.yaml

apiVersion: v1
kind: Namespace
metadata:
  name: incident
  labels:
    autofix: "true"            # OpenClaw may APPLY fixes here
---
apiVersion: v1
kind: Namespace
metadata:
  name: payments               # no autofix label -> triage only

RBAC has no label selector of its own, so the label is turned into an authorization boundary by where the fix permission is bound. The sandbox's ServiceAccount gets a cluster-wide read so it can triage anything. The write rules live in a ClusterRole that is not bound cluster-wide:

yamlyaml/sre-rbac.yaml

# Cluster-wide READ -> triage anywhere (bound via a ClusterRoleBinding).
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: sre-harness-read
rules:
  - apiGroups: [""]
    resources: ["pods", "pods/log", "services", "events", "nodes", "namespaces", "configmaps"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["apps"]
    resources: ["deployments", "replicasets", "statefulsets", "daemonsets"]
    verbs: ["get", "list", "watch"]
---
# WRITE rules as a ClusterRole with NO cluster-wide binding. A namespaced
# RoleBinding grants it only in autofix=true namespaces (the reconcile below).
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: sre-harness-fix
rules:
  - apiGroups: ["apps"]
    resources: ["deployments", "replicasets"]
    verbs: ["get", "list", "watch", "update", "patch"]
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "watch", "delete"]

A small reconcile then creates a namespaced RoleBinding to the fix ClusterRole only in namespaces labelled autofix=true:

bash04-harness.sh — the reconcile (copy-paste runnable)

# The OpenShell sandbox runs as this ServiceAccount (04-harness.sh discovers it
# for you; these are the OpenShell defaults this lab uses):
SANDBOX_NS=openshell
SANDBOX_SA=openshell-sandbox

# Bind the fix ClusterRole only in namespaces that opt in via the label:
for ns in $(kubectl get ns -l autofix=true -o name); do
  kubectl -n "${ns##*/}" create rolebinding sre-harness-fix \
    --clusterrole=sre-harness-fix \
    --serviceaccount="$SANDBOX_NS:$SANDBOX_SA" \
    --dry-run=client -o yaml | kubectl apply -f -
done

Label a namespace and re-run the reconcile to extend the agent's reach. Remove the label and the binding goes away. You can prove the boundary directly, as the sandbox's identity, before the agent ever runs:

$ kubectl auth can-i patch deploy -n incident yes $ kubectl auth can-i patch deploy -n payments no $ kubectl -n payments set image deploy/checkout checkout=nginx:1.27-alpine Error from server (Forbidden): deployments.apps "checkout" is forbidden: User "system:serviceaccount:openshell:openshell-sandbox" cannot patch resource "deployments" in API group "apps" in the namespace "payments"

Escalate to Slack when a fix is denied

The escalation path is deliberately simple: a Slack Incoming Webhook the agent posts to with curl. The webhook URL is never put in the AgentHarness object or in the repo. It is written into the sandbox at runtime from your environment, and the agent reads it from there only when it needs to escalate. When OpenClaw remediates the cluster, one turn does both things at once:

## Summary FIXED: 1 incident/checkout - invalid image nginx:9.99-doesnotexist set image -> nginx:1.27-alpine, pod Running (1/1) ESCALATED: 1 payments/checkout - same fault, patch returned 403 Forbidden posted to Slack webhook -> HTTP 200

The agent tried to fix payments, Kubernetes refused, and it did exactly what an on-call human would: it raised it rather than working around the permission. Without a webhook set, the same escalation lands in its reply instead.

Wiring the model: pointing OpenClaw at the provider

One detail is worth calling out because it is easy to trip over. kagent reads your model key from the ModelConfig's secret and hands it to the sandbox, but it does not write the key into OpenClaw's config in clear text. It writes a placeholder that the OpenShell gateway resolves from the sandbox process environment at request time. And when the ModelConfig sets no explicit base URL, kagent defaults OpenClaw to an internal inference proxy address that exists in an enterprise deployment but not in a self-contained kind lab. The fix is one field on the ModelConfig, pointing OpenClaw straight at the provider:

yamlyaml/modelconfig.yaml

apiVersion: kagent.dev/v1alpha2
kind: ModelConfig
metadata:
  name: anthropic-haiku
  namespace: kagent
spec:
  provider: Anthropic
  model: claude-haiku-4-5
  apiKeySecret: kagent-anthropic
  apiKeySecretKey: ANTHROPIC_API_KEY
  anthropic:
    maxTokens: 4096
    temperature: "0.2"
    baseUrl: https://api.anthropic.com/v1   # reach the provider directly

The lab's equip step also installs kubectl into the sandbox and writes a kubeconfig that authenticates as the sandbox ServiceAccount, so the agent's reads and writes carry the identity the RBAC above is written against.

What is enforced versus guided

This is the point of the lab, so it is worth being precise. The autofix rule is enforced by Kubernetes RBAC. The agent is not pre-told which namespaces it may change; it simply discovers, when it tries, that a namespace will not accept its patch. That 403 is real. A different prompt, a different model, or a confused agent cannot get past it, because the permission to write was never granted there. The only thing the prompt supplies is the escalation behaviour: when blocked, notify rather than retry. Infrastructure decides the blast radius; the model decides the bedside manner.

Reset and teardown

./scripts/06-broken-app.sh        # re-break both deployments to run it again
./scripts/quick.sh teardown       # delete the kind cluster

Versions

Built and verified on:

OSSvalidated 2026-06-18

Gateway APIv1.4.0

AgentHarness: an OpenClaw SRE sandbox that triages and fixes the cluster