kagent's newest primitive, the AgentHarness, does something the
Agent and SandboxAgent types do not: it has no runtime baked
in. Instead kagent asks an OpenShell gateway to provision a long-lived
OpenClaw sandbox that you attach to. This lab gives that sandbox a real
job. It is your on-call SRE: you ask it, in natural language, to triage the cluster it
is running in and fix what is broken. The interesting part is the boundary. What the
sandbox is allowed to change is decided by Kubernetes RBAC, not by how nicely
you phrase the prompt.
What an AgentHarness is
A normal kagent Agent bundles a runtime and serves requests. An
AgentHarness is different: it is a declaration that says "stand me up a
sandbox of this kind, keep it ready, and surface it next to my agents." The backend
owns the environment's lifecycle. On the verified CRD the shape is small:
yamlyaml/agentharness.yaml
apiVersion: kagent.dev/v1alpha2
kind: AgentHarness
metadata:
name: sre-oncall
namespace: kagent
spec:
backend: openclaw # openclaw | nemoclaw
description: "SRE on-call sandbox - attach, triage the cluster, and fix it"
modelConfigRef: anthropic-haiku
network:
allowedDomains:
- api.anthropic.com
When this is accepted, kagent calls OpenShell, OpenShell provisions an OpenClaw
sandbox pod, and the harness reports Ready=True with a connection
endpoint. From there it is a shell you attach to, and OpenClaw is a capable agent
with its own tools.
What the lab builds
One kind cluster, brought up by an idempotent quick.sh up:
- Gateway API CRDs and the upstream agent-sandbox controller that OpenShell builds on.
- OpenShell gateway, the backend the AgentHarness talks to over gRPC.
- kagent OSS, with its controller pointed at OpenShell so the AgentHarness backend is registered.
- A ModelConfig (Anthropic) and the AgentHarness itself, which produces the OpenClaw sandbox.
- RBAC: cluster-wide read for triage, plus a fix permission that is bound only where a namespace opts in.
- Two app namespaces, each with the same deliberately broken Deployment.
The only secret you must provide is your own model key. A Slack webhook is optional. Nothing secret is stored in the repo.
Bring it up
export ANTHROPIC_API_KEY=sk-ant-...
export SLACK_WEBHOOK_URL=https://hooks.slack.com/services/... # optional
./scripts/quick.sh up
First run pulls a few large images (the OpenShell gateway and supervisor, kagent, and
the OpenClaw sandbox base), so give it a few minutes. When it finishes, the harness is
ready and a checkout Deployment is broken on purpose in both namespaces:
it is pinned to an image tag that does not exist, so its pods stick in
ImagePullBackOff.
yamlyaml/broken-app/deployment.yaml — applied into incident + payments
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout
labels: { app: checkout }
spec:
replicas: 1
selector:
matchLabels: { app: checkout }
template:
metadata:
labels: { app: checkout }
spec:
containers:
- name: checkout
image: nginx:9.99-doesnotexist # BROKEN ON PURPOSE: tag does not exist
ports:
- containerPort: 80
kubectl --context kind-harness -n kagent get agentharness sre-oncall
# NAME BACKEND READY ID AGE
# sre-oncall openclaw True kagent-sre-oncall 1m
Ask OpenClaw to triage
Interaction is a one-liner. ask.sh streams OpenClaw's reasoning and its
actions back to your terminal. Start with a read-only question:
./scripts/ask.sh "what is broken in the cluster?"
Under the hood this attaches to the sandbox and runs an OpenClaw agent turn. OpenClaw
lists namespaces, finds the ImagePullBackOff pods, and describes the root
cause it sees: an image tag that does not exist. Now ask it to remediate the whole
cluster, and tell it how to escalate anything it cannot fix:
./scripts/ask.sh "Triage every namespace for broken workloads. Fix what you are \
permitted to. If Kubernetes denies a change (403), do NOT force it - post a concise \
summary to the Slack webhook in /sandbox/.slack-webhook via curl. Summarize what you \
fixed and what you escalated."
The guardrail is RBAC, keyed off a namespace label
The same fault exists in two namespaces. The only difference between them is one label.
incident
- label
autofix=true- read
- yes (triage)
- patch
- yes
- outcome
- OpenClaw sets a valid image, pod recovers
payments
- label
- none
- read
- yes (triage)
- patch
- 403 Forbidden
- outcome
- OpenClaw escalates to Slack, leaves it untouched
The only difference between the two namespaces is a label:
yamlyaml/namespaces.yaml
apiVersion: v1
kind: Namespace
metadata:
name: incident
labels:
autofix: "true" # OpenClaw may APPLY fixes here
---
apiVersion: v1
kind: Namespace
metadata:
name: payments # no autofix label -> triage only
RBAC has no label selector of its own, so the label is turned into an authorization boundary by where the fix permission is bound. The sandbox's ServiceAccount gets a cluster-wide read so it can triage anything. The write rules live in a ClusterRole that is not bound cluster-wide:
yamlyaml/sre-rbac.yaml
# Cluster-wide READ -> triage anywhere (bound via a ClusterRoleBinding).
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: sre-harness-read
rules:
- apiGroups: [""]
resources: ["pods", "pods/log", "services", "events", "nodes", "namespaces", "configmaps"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets", "statefulsets", "daemonsets"]
verbs: ["get", "list", "watch"]
---
# WRITE rules as a ClusterRole with NO cluster-wide binding. A namespaced
# RoleBinding grants it only in autofix=true namespaces (the reconcile below).
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: sre-harness-fix
rules:
- apiGroups: ["apps"]
resources: ["deployments", "replicasets"]
verbs: ["get", "list", "watch", "update", "patch"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch", "delete"]
A small reconcile then creates a namespaced RoleBinding to the fix ClusterRole only
in namespaces labelled autofix=true:
bash04-harness.sh — the reconcile (copy-paste runnable)
# The OpenShell sandbox runs as this ServiceAccount (04-harness.sh discovers it
# for you; these are the OpenShell defaults this lab uses):
SANDBOX_NS=openshell
SANDBOX_SA=openshell-sandbox
# Bind the fix ClusterRole only in namespaces that opt in via the label:
for ns in $(kubectl get ns -l autofix=true -o name); do
kubectl -n "${ns##*/}" create rolebinding sre-harness-fix \
--clusterrole=sre-harness-fix \
--serviceaccount="$SANDBOX_NS:$SANDBOX_SA" \
--dry-run=client -o yaml | kubectl apply -f -
done
Label a namespace and re-run the reconcile to extend the agent's reach. Remove the label and the binding goes away. You can prove the boundary directly, as the sandbox's identity, before the agent ever runs:
Escalate to Slack when a fix is denied
The escalation path is deliberately simple: a Slack Incoming Webhook the agent posts to
with curl. The webhook URL is never put in the AgentHarness object or in
the repo. It is written into the sandbox at runtime from your environment, and the agent
reads it from there only when it needs to escalate. When OpenClaw remediates the
cluster, one turn does both things at once:
The agent tried to fix payments, Kubernetes refused, and it did exactly
what an on-call human would: it raised it rather than working around the permission.
Without a webhook set, the same escalation lands in its reply instead.
Wiring the model: pointing OpenClaw at the provider
One detail is worth calling out because it is easy to trip over. kagent reads your model key from the ModelConfig's secret and hands it to the sandbox, but it does not write the key into OpenClaw's config in clear text. It writes a placeholder that the OpenShell gateway resolves from the sandbox process environment at request time. And when the ModelConfig sets no explicit base URL, kagent defaults OpenClaw to an internal inference proxy address that exists in an enterprise deployment but not in a self-contained kind lab. The fix is one field on the ModelConfig, pointing OpenClaw straight at the provider:
yamlyaml/modelconfig.yaml
apiVersion: kagent.dev/v1alpha2
kind: ModelConfig
metadata:
name: anthropic-haiku
namespace: kagent
spec:
provider: Anthropic
model: claude-haiku-4-5
apiKeySecret: kagent-anthropic
apiKeySecretKey: ANTHROPIC_API_KEY
anthropic:
maxTokens: 4096
temperature: "0.2"
baseUrl: https://api.anthropic.com/v1 # reach the provider directly
The lab's equip step also installs kubectl into the sandbox and writes a
kubeconfig that authenticates as the sandbox ServiceAccount, so the agent's reads and
writes carry the identity the RBAC above is written against.
What is enforced versus guided
This is the point of the lab, so it is worth being precise. The autofix rule is enforced by Kubernetes RBAC. The agent is not pre-told which namespaces it may change; it simply discovers, when it tries, that a namespace will not accept its patch. That 403 is real. A different prompt, a different model, or a confused agent cannot get past it, because the permission to write was never granted there. The only thing the prompt supplies is the escalation behaviour: when blocked, notify rather than retry. Infrastructure decides the blast radius; the model decides the bedside manner.
Reset and teardown
./scripts/06-broken-app.sh # re-break both deployments to run it again
./scripts/quick.sh teardown # delete the kind cluster
See also
- kagent docs — AgentHarness example
- kagent docs — installation and enabling AgentHarness support
- Sibling lab — Per-User MCP Tool RBAC at the gateway
- Sibling lab — Two-layer human-in-the-loop on an MCP agent
Versions
Built and verified on:
v1.4.0