Agent-to-agent (A2A) is the part that makes a fleet of agents more than a pile of
prompts: one agent discovers another, hands it a task over a
standard protocol, and gets back a structured result. This lab puts that
on display. An on-call SRE orchestrator triages a broken
database, finds a DBA specialist by reading its A2A agent card,
and delegates the diagnosis with a real message/send call. Every
piece of that exchange is shown from the live cluster: the card, the inter-agent
call captured on the wire, and the Task that comes back, then the
fix is applied and the database recovers. Running on Solo Enterprise for kagent,
the caller's identity also travels with every hop as an exchanged On-Behalf-Of
token.
The use case
A field engineer pings the on-call channel: "the orders database won't start." You do not want one giant agent holding every runbook for every system. You want an on-call generalist that triages anything, and specialists it can pull in on demand, each with its own instructions, its own tools, and its own blast radius. That is the A2A pattern: the SRE orchestrator owns the incident, and when it realises the problem is a database it brings in the DBA agent rather than guessing. The DBA never talks to the user directly; it is only ever reached through the orchestrator, on the user's behalf.
sre-orchestrator
- role
- Triage any cluster incident
- tools
- k8s read +
type: Agent→ dba-agent - a2a
- Acts as an A2A client to the DBA
dba-agent
- role
- Diagnose database workloads
- tools
- k8s describe / logs / events
- a2a
- An A2A server; skill on its agent card
The flow
A2A in action
This is the heart of the lab. Three things happen over the A2A protocol, and all three are shown from the live cluster below: discovery (read the agent card), delegation (send a task), and the result (a Task comes back).
1. Discovery: the DBA's agent card
Every kagent agent serves an A2A agent card at
/.well-known/agent.json. It advertises the agent's skills, its
transport, and its capabilities, which is how the orchestrator (or any A2A
client) knows what the DBA can do and how to call it. This is the real card:
jsonGET /api/a2a/kagent/dba-agent/.well-known/agent.json
{
"name": "dba_agent",
"description": "Database SRE specialist. Diagnoses Postgres/database workload problems ...",
"url": "http://kagent-controller.kagent.svc.cluster.local:8083/api/a2a/kagent/dba-agent/",
"capabilities": { "streaming": true, "pushNotifications": false, "stateTransitionHistory": true },
"defaultInputModes": ["text"],
"defaultOutputModes": ["text"],
"skills": [
{
"id": "diagnose-db",
"name": "Diagnose a database incident",
"description": "Find the root cause of a database workload failure and propose the fix",
"tags": ["database", "postgres", "sre"],
"examples": ["The orders database pod is crashlooping, why?", "Postgres won't start after the last deploy"]
}
],
"preferredTransport": "JSONRPC"
}
That one document is the whole discovery contract. Reading it top to bottom:
nameanddescriptionare the identity, and the description is what the orchestrator's model actually reads to decide this is the agent worth delegating to.urlis the A2A endpoint a client posts to, andpreferredTransportsays how to talk to it (JSONRPChere), so the caller knows the wire format before it sends anything.capabilitiesare the optional protocol features a client should check before relying on them:streamingfor incremental updates,pushNotificationsfor webhook-style callbacks,stateTransitionHistoryfor the task history this agent keeps.defaultInputModesanddefaultOutputModesare the content types it accepts and returns (textfor this DBA).skillsis the catalogue of what it can do. Each skill carries anid, a humannameanddescription,tags, andexamples. The orchestrator's model matches the incident against those descriptions and examples to pick the right skill, which is exactly the selection that produces the call in step 2.
The card is served at a well-known path so any A2A client can fetch it with a plain GET, before opening a session and without authenticating. kagent serves it at /.well-known/agent.json; the A2A specification standardises the location as /.well-known/agent-card.json (RFC 8615) and also lets agents be listed in a curated registry that clients query by skill or tag. Either way the card is the thing a caller reads first: it turns "call this agent" into something a client can do mechanically, the same role MCP's Server Cards now play for tools.
2. Delegation: the orchestrator sends a task
After it triages (it pulls the pod and events itself first), the orchestrator
delegates by sending the DBA a JSON-RPC message/send. This is the
actual agent-to-agent request captured on the wire arriving at
the DBA pod. Note the orchestrator is an HTTP client here
(python-httpx), the call is marked as agent-originated
(x-kagent-source: agent), the user's identity rides along
(x-user-id), and the orchestrator has written its own detailed task
for the specialist from what it found:
httporchestrator → dba-agent, captured on dba-agent:8080
POST / HTTP/1.1
Host: dba-agent.kagent:8080
User-Agent: python-httpx/0.28.1
x-kagent-source: agent
x-user-id: ca0c9432-6f36-44cc-9fd5-c66048cdfc37
{
"jsonrpc": "2.0",
"id": "cff5fae8-7be6-4373-a2c3-1af848a38b62",
"method": "message/send",
"params": {
"configuration": { "acceptedOutputModes": [], "blocking": true },
"message": {
"role": "user",
"kind": "message",
"contextId": "7d068fc6-3a67-4586-bf36-58c49e43b272",
"messageId": "4d1f7743-ad78-4b74-aeaf-aed4a8973e0e",
"parts": [{ "kind": "text", "text":
"The orders-db pod (postgres:16-alpine) in the orders namespace is in
CrashLoopBackOff with 30 restarts. The container exits with code 1 ...
Environment: POSTGRES_DB=orders ... Please diagnose why the PostgreSQL
container is failing to start and what should be fixed." }]
}
}
}
3. Result: a Task comes back
The DBA does its own work (describe, logs, events) and replies with an A2A
Task: a stateful object with a status, a
history of the messages it produced while working, and the
artifacts that hold the answer. Here is the response envelope and
the diagnosis artifact:
jsondba-agent → orchestrator (A2A Task)
{
"jsonrpc": "2.0",
"id": "cff5fae8-...",
"result": {
"kind": "task",
"id": "08508326-...",
"contextId": "135882bd-...",
"status": { "state": "completed" },
"history": [ /* 9 messages: the DBA's describe/logs/events steps */ ],
"artifacts": [ { "parts": [ { "kind": "text", "text": "...diagnosis below..." } ] } ]
}
}
markdownthe diagnosis artifact (the DBA's answer)
## Root Cause
The `orders-db` pod is crashlooping because the `POSTGRES_PASSWORD` environment
variable is not set. The container is configured with only POSTGRES_DB=orders, but
the postgres:16-alpine image requires either POSTGRES_PASSWORD for the superuser,
or POSTGRES_HOST_AUTH_METHOD=trust (not recommended). Without it, initialization
fails with exit code 1 and the pod crashloops.
## Exact Fix
1. Create a Secret with the superuser password:
kubectl create secret generic orders-db-secret \
--from-literal=password='<strong-password>' -n orders
2. Add POSTGRES_PASSWORD to the deployment from that Secret (secretKeyRef).
3. Verify:
kubectl rollout status deployment/orders-db -n orders
kubectl get pod -n orders -l app=orders-db
Run it yourself
A standalone kind cluster with Keycloak, Solo Enterprise for kagent (the install
brings the bundled MCP tool server and a default-model-config), the
two agents, and the broken database.
export ANTHROPIC_API_KEY=sk-ant-...
export SOLO_LICENSE_KEY=... # Solo Enterprise for kagent
export AGENTGATEWAY_LICENSE_KEY=... # enterprise agentgateway
./scripts/quick.sh up
# drive the orchestrator; watch it delegate to the DBA over A2A
./scripts/ask.sh "the orders database won't start - investigate and tell me the fix"
# or talk to the specialist directly over its A2A endpoint
AGENT=dba-agent ./scripts/ask.sh "why is orders-db crashlooping?"
There is no kagent CLI dependency: ask.sh is plain
curl against the A2A JSON-RPC endpoint. One sharp edge worth knowing,
the agent path needs a trailing slash, otherwise the controller
returns a 307 redirect that drops the POST body.
bashwhat ask.sh does (A2A message/send by hand)
kubectl -n kagent port-forward svc/kagent-controller 8083:8083 &
# discover: the agent card
curl localhost:8083/api/a2a/kagent/sre-orchestrator/.well-known/agent.json
# send a task (note the trailing slash on the agent path)
curl -X POST "http://localhost:8083/api/a2a/kagent/sre-orchestrator/" \
-H 'Content-Type: application/json' \
-d '{"jsonrpc":"2.0","id":"1","method":"message/send",
"params":{"message":{"role":"user",
"parts":[{"kind":"text","text":"the orders database won'\''t start"}],
"messageId":"m1"}}}'
All the YAML
The A2A wiring is two fields. The orchestrator lists the specialist as a tool
with type: Agent; the specialist advertises its skill with
a2aConfig.skills (that is what becomes the agent card above).
Everything the lab applies is here.
yamlyaml/agents/sre-orchestrator.yaml
apiVersion: kagent.dev/v1alpha2
kind: Agent
metadata:
name: sre-orchestrator
namespace: kagent
spec:
type: Declarative
description: On-call SRE orchestrator. Triages cluster incidents and delegates
database-specific diagnosis to the dba-agent.
declarative:
modelConfig: default-model-config
systemMessage: |
You are the on-call SRE. Triage incidents by inspecting Kubernetes
resources and events. When the problem is database-related DELEGATE the
diagnosis to the dba-agent and incorporate its findings. Produce a short
incident summary: what is broken, the root cause, the recommended
remediation. Note which finding came from the DBA specialist.
tools:
- type: McpServer # its own k8s read tools
mcpServer:
kind: RemoteMCPServer
name: kagent-tool-server
toolNames: [k8s_get_resources, k8s_get_events, k8s_describe_resource]
- type: Agent # <-- A2A delegation
agent:
name: dba-agent
yamlyaml/agents/dba-agent.yaml
apiVersion: kagent.dev/v1alpha2
kind: Agent
metadata:
name: dba-agent
namespace: kagent
spec:
type: Declarative
description: Database SRE specialist. Diagnoses Postgres/database workload
problems from Kubernetes state.
declarative:
modelConfig: default-model-config
systemMessage: |
You are a database reliability specialist for the platform's Postgres
databases. Given a symptom, inspect the relevant Deployment and pods
(describe, logs, events), then explain the root cause in plain language and
give the exact remediation. You diagnose and recommend; you do not apply
destructive changes yourself.
tools:
- type: McpServer
mcpServer:
kind: RemoteMCPServer
name: kagent-tool-server
toolNames: [k8s_get_resources, k8s_describe_resource, k8s_get_pod_logs, k8s_get_events]
a2aConfig: # <-- becomes the agent card
skills:
- id: diagnose-db
name: Diagnose a database incident
description: Find the root cause of a database workload failure and propose the fix
inputModes: [text]
outputModes: [text]
tags: [database, postgres, sre]
examples:
- "The orders database pod is crashlooping, why?"
- "Postgres won't start after the last deploy"
yamlyaml/incident/postgres.yaml: the planted incident
apiVersion: v1
kind: Namespace
metadata: { name: orders, labels: { purpose: a2a-demo } }
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: orders-db
namespace: orders
annotations: { incident/summary: "orders Postgres will not start after a deploy" }
spec:
replicas: 1
selector: { matchLabels: { app: orders-db } }
template:
metadata: { labels: { app: orders-db } }
spec:
containers:
- name: postgres
image: postgres:16-alpine
# BROKEN ON PURPOSE: no POSTGRES_PASSWORD and no trust auth method.
env:
- { name: POSTGRES_DB, value: orders }
ports: [{ containerPort: 5432 }]
resources:
requests: { cpu: 10m, memory: 32Mi }
limits: { cpu: 250m, memory: 128Mi }
---
apiVersion: v1
kind: Service
metadata: { name: orders-db, namespace: orders }
spec:
selector: { app: orders-db }
ports: [{ port: 5432, targetPort: 5432 }]
yamlyaml/accesspolicy/: who may call which agent (identity-driven)
# Alice's group may call the orchestrator (matched on the token's groups claim).
apiVersion: policy.kagent-enterprise.solo.io/v1alpha1
kind: AccessPolicy
metadata: { name: allow-fieldfte-orchestrator, namespace: kagent }
spec:
action: ALLOW
from:
subjects:
- kind: UserGroup
userGroup:
claimName: groups
claimValue: field-fte
issuer: http://keycloak.keycloak.svc.cluster.local/realms/solo
targetRef: { kind: Agent, name: sre-orchestrator }
---
# The orchestrator (acting agent) may call the dba-agent.
apiVersion: policy.kagent-enterprise.solo.io/v1alpha1
kind: AccessPolicy
metadata: { name: allow-orchestrator-to-dba, namespace: kagent }
spec:
action: ALLOW
from:
subjects: [{ kind: Agent, name: sre-orchestrator, namespace: kagent }]
targetRef: { kind: Agent, name: dba-agent }
---
# The end user may NOT call the dba-agent directly.
apiVersion: policy.kagent-enterprise.solo.io/v1alpha1
kind: AccessPolicy
metadata: { name: deny-user-direct-dba, namespace: kagent }
spec:
action: DENY
from:
subjects:
- kind: UserGroup
userGroup: { claimName: groups, claimValue: field-fte, issuer: http://keycloak.keycloak.svc.cluster.local/realms/solo }
targetRef: { kind: Agent, name: dba-agent }
yamlkagent-enterprise install values (OIDC + OBO + role mapping)
# helm upgrade --install kagent oci://.../kagent-enterprise (key values)
providers:
default: anthropic
anthropic: { apiKey: $ANTHROPIC_API_KEY }
oidc:
issuer: http://keycloak.keycloak.svc.cluster.local/realms/solo
clientId: kagent
skipOBO: false # OBO on: exchange, don't just forward
kagent-tools: { enabled: true } # bundled k8s MCP tool server
rbac:
roleMapping:
# map the Keycloak groups claim (lowercase!) to kagent roles
roleMapper: 'claims.groups.transformList(i, v, v in rolesMap, rolesMap[v])'
roleMappings:
field-fte: global.Admin
field-trial: global.Reader
field-admin: global.Admin
yamlyaml/keycloak/keycloak.yaml: the IdP (realm solo: alice/field-fte)
apiVersion: apps/v1
kind: StatefulSet
metadata: { name: keycloak, namespace: keycloak }
spec:
serviceName: keycloak
replicas: 1
selector: { matchLabels: { app: keycloak } }
template:
metadata: { labels: { app: keycloak } }
spec:
containers:
- name: keycloak
image: quay.io/keycloak/keycloak:26.3
args: [start-dev, --import-realm]
env:
# Pin the issuer so the token `iss` matches the controller's oidc.issuer
# whether minted via port-forward or in-cluster.
- { name: KC_HOSTNAME, value: "http://keycloak.keycloak.svc.cluster.local" }
- { name: KC_HOSTNAME_STRICT, value: "false" }
- { name: KC_HOSTNAME_BACKCHANNEL_DYNAMIC, value: "true" }
volumeMounts:
- { name: realm-import, mountPath: /opt/keycloak/data/import, readOnly: true }
volumes:
- name: realm-import
configMap: { name: keycloak-realm-import }
# realm `solo`: users alice/field-fte, bob/field-trial, carol/field-admin;
# public client `kagent` with a groups mapper and an audience=kagent mapper.
How the fix was made
The agents are read-only by design: the DBA diagnoses and recommends, it does not mutate the cluster. So the incident reads like an on-call thread, a human (or a pipeline) applies the recommended fix at the end:
@sre-orchestrator the orders database won't start, can you investigate and tell me the root cause and the exact fix?
Found orders-db in the orders namespace in CrashLoopBackOff (30 restarts, exit 1). It is a Postgres pod, so I am bringing in the DBA specialist to diagnose.
Root cause: the Deployment sets POSTGRES_DB=orders but no POSTGRES_PASSWORD and no trust auth, so the postgres entrypoint exits 1 on init. Fix: create a Secret with the superuser password and reference it as POSTGRES_PASSWORD via secretKeyRef, then roll the Deployment.
Incident summary: orders-db down due to a missing Postgres password (per the DBA specialist). Remediation: add POSTGRES_PASSWORD from a Secret and restart. Applying now.
Fix applied · deployment/orders-db rolled out · pod is Running · incident resolved.
Applying the DBA's exact remediation, the database recovers:
before: the planted incident
$ kubectl -n orders get pods -l app=orders-db NAME READY STATUS RESTARTS orders-db-58cc8c4df6-jsspl 0/1 CrashLoopBackOff 30
after: the recommended fix
$ kubectl -n orders create secret generic orders-db-secret \
--from-literal=password='<strong-password>'
$ kubectl -n orders patch deployment orders-db ... POSTGRES_PASSWORD
$ kubectl -n orders rollout status deployment/orders-db
deployment "orders-db" successfully rolled out
NAME READY STATUS RESTARTS
orders-db-5c58775db8-gpqv4 1/1 Running 0
Identity rides the chain
Because the lab runs on Solo Enterprise for kagent with OIDC and On-Behalf-Of
(OBO) turned on, the caller's identity travels with the A2A delegation. When the
controller proxies Alice's call into an agent, it swaps her Keycloak token for a
kagent-signed OBO token: her subject is preserved, a delegated act
claim names the acting agent, and the issuer becomes kagent.
show-obo.sh captures that token live and decodes it:
Inbound: Alice's Keycloak token
{
"iss": "…/realms/solo",
"sub": "ca0c9432-…-c66048cdfc37",
"aud": "kagent",
"groups": ["field-fte"]
// no act claim
}
Exchanged: kagent OBO token (captured)
{
"iss": "kagent.kagent",
"sub": "ca0c9432-…-c66048cdfc37", // preserved
"act": {
"sub": "…serviceaccount:kagent:sre-orchestrator"
},
"aud": ["kagent/sre-orchestrator"]
}
x-user-id and x-kagent-source: agent; the signed OBO
bearer rides the controller-to-agent hop. The AccessPolicy resources
are accepted but enforce only through kagent's Istio authorization-policy
translation, which needs a mesh. The A2A delegation and the OBO exchange are
independent and both work as shown.
Extending it
- Cross-namespace delegation. Add
agent.namespaceto the reference; the target opts in withspec.allowedNamespaces. - Approval gates. Put a mutating tool in the specialist's
requireApprovallist so a delegated call pauses for human confirmation before it runs. - More specialists. Add a network or storage agent and let the orchestrator pick the right one from each agent card's skills.
- Metrics-aware diagnosis. Give the DBA the bundled Grafana
MCP
query_prometheustool to read DB metrics alongside k8s state.
See also
- kagent docs: tools and agent-as-tool
- The A2A protocol
- JWT token exchange: OIDC IdP + agentgateway STS
- Sibling lab: an OpenClaw SRE sandbox that triages and fixes the cluster
Versions
Built and verified on:
v2.3.40.4.3v1.4.0