Agent-to-Agent in kagent: an SRE that delegates DB incidents to a DBA, by Tom O'Rourke

Agent-to-agent (A2A) is the part that makes a fleet of agents more than a pile of prompts: one agent discovers another, hands it a task over a standard protocol, and gets back a structured result. This lab puts that on display. An on-call SRE orchestrator triages a broken database, finds a DBA specialist by reading its A2A agent card, and delegates the diagnosis with a real message/send call. Every piece of that exchange is shown from the live cluster: the card, the inter-agent call captured on the wire, and the Task that comes back, then the fix is applied and the database recovers. Running on Solo Enterprise for kagent, the caller's identity also travels with every hop as an exchanged On-Behalf-Of token.

The use case

A field engineer pings the on-call channel: "the orders database won't start." You do not want one giant agent holding every runbook for every system. You want an on-call generalist that triages anything, and specialists it can pull in on demand, each with its own instructions, its own tools, and its own blast radius. That is the A2A pattern: the SRE orchestrator owns the incident, and when it realises the problem is a database it brings in the DBA agent rather than guessing. The DBA never talks to the user directly; it is only ever reached through the orchestrator, on the user's behalf.

generalist · calls the specialist

sre-orchestrator

role: Triage any cluster incident
tools: k8s read + type: Agent → dba-agent
a2a: Acts as an A2A client to the DBA

specialist · reached over A2A

dba-agent

role: Diagnose database workloads
tools: k8s describe / logs / events
a2a: An A2A server; skill on its agent card

The flow

A2A in action

This is the heart of the lab. Three things happen over the A2A protocol, and all three are shown from the live cluster below: discovery (read the agent card), delegation (send a task), and the result (a Task comes back).

1. Discovery: the DBA's agent card

Every kagent agent serves an A2A agent card at /.well-known/agent.json. It advertises the agent's skills, its transport, and its capabilities, which is how the orchestrator (or any A2A client) knows what the DBA can do and how to call it. This is the real card:

jsonGET /api/a2a/kagent/dba-agent/.well-known/agent.json

{
  "name": "dba_agent",
  "description": "Database SRE specialist. Diagnoses Postgres/database workload problems ...",
  "url": "http://kagent-controller.kagent.svc.cluster.local:8083/api/a2a/kagent/dba-agent/",
  "capabilities": { "streaming": true, "pushNotifications": false, "stateTransitionHistory": true },
  "defaultInputModes": ["text"],
  "defaultOutputModes": ["text"],
  "skills": [
    {
      "id": "diagnose-db",
      "name": "Diagnose a database incident",
      "description": "Find the root cause of a database workload failure and propose the fix",
      "tags": ["database", "postgres", "sre"],
      "examples": ["The orders database pod is crashlooping, why?", "Postgres won't start after the last deploy"]
    }
  ],
  "preferredTransport": "JSONRPC"
}

That one document is the whole discovery contract. Reading it top to bottom:

name and description are the identity, and the description is what the orchestrator's model actually reads to decide this is the agent worth delegating to.
url is the A2A endpoint a client posts to, and preferredTransport says how to talk to it (JSONRPC here), so the caller knows the wire format before it sends anything.
capabilities are the optional protocol features a client should check before relying on them: streaming for incremental updates, pushNotifications for webhook-style callbacks, stateTransitionHistory for the task history this agent keeps.
defaultInputModes and defaultOutputModes are the content types it accepts and returns (text for this DBA).
skills is the catalogue of what it can do. Each skill carries an id, a human name and description, tags, and examples. The orchestrator's model matches the incident against those descriptions and examples to pick the right skill, which is exactly the selection that produces the call in step 2.

The card is served at a well-known path so any A2A client can fetch it with a plain GET, before opening a session and without authenticating. kagent serves it at /.well-known/agent.json; the A2A specification standardises the location as /.well-known/agent-card.json (RFC 8615) and also lets agents be listed in a curated registry that clients query by skill or tag. Either way the card is the thing a caller reads first: it turns "call this agent" into something a client can do mechanically, the same role MCP's Server Cards now play for tools.

2. Delegation: the orchestrator sends a task

After it triages (it pulls the pod and events itself first), the orchestrator delegates by sending the DBA a JSON-RPC message/send. This is the actual agent-to-agent request captured on the wire arriving at the DBA pod. Note the orchestrator is an HTTP client here (python-httpx), the call is marked as agent-originated (x-kagent-source: agent), the user's identity rides along (x-user-id), and the orchestrator has written its own detailed task for the specialist from what it found:

httporchestrator → dba-agent, captured on dba-agent:8080

POST / HTTP/1.1
Host: dba-agent.kagent:8080
User-Agent: python-httpx/0.28.1
x-kagent-source: agent
x-user-id: ca0c9432-6f36-44cc-9fd5-c66048cdfc37

{
  "jsonrpc": "2.0",
  "id": "cff5fae8-7be6-4373-a2c3-1af848a38b62",
  "method": "message/send",
  "params": {
    "configuration": { "acceptedOutputModes": [], "blocking": true },
    "message": {
      "role": "user",
      "kind": "message",
      "contextId": "7d068fc6-3a67-4586-bf36-58c49e43b272",
      "messageId": "4d1f7743-ad78-4b74-aeaf-aed4a8973e0e",
      "parts": [{ "kind": "text", "text":
        "The orders-db pod (postgres:16-alpine) in the orders namespace is in
         CrashLoopBackOff with 30 restarts. The container exits with code 1 ...
         Environment: POSTGRES_DB=orders ... Please diagnose why the PostgreSQL
         container is failing to start and what should be fixed." }]
    }
  }
}

3. Result: a Task comes back

The DBA does its own work (describe, logs, events) and replies with an A2A Task: a stateful object with a status, a history of the messages it produced while working, and the artifacts that hold the answer. Here is the response envelope and the diagnosis artifact:

jsondba-agent → orchestrator (A2A Task)

{
  "jsonrpc": "2.0",
  "id": "cff5fae8-...",
  "result": {
    "kind": "task",
    "id": "08508326-...",
    "contextId": "135882bd-...",
    "status": { "state": "completed" },
    "history": [ /* 9 messages: the DBA's describe/logs/events steps */ ],
    "artifacts": [ { "parts": [ { "kind": "text", "text": "...diagnosis below..." } ] } ]
  }
}

markdownthe diagnosis artifact (the DBA's answer)

## Root Cause

The `orders-db` pod is crashlooping because the `POSTGRES_PASSWORD` environment
variable is not set. The container is configured with only POSTGRES_DB=orders, but
the postgres:16-alpine image requires either POSTGRES_PASSWORD for the superuser,
or POSTGRES_HOST_AUTH_METHOD=trust (not recommended). Without it, initialization
fails with exit code 1 and the pod crashloops.

## Exact Fix

1. Create a Secret with the superuser password:
   kubectl create secret generic orders-db-secret \
     --from-literal=password='<strong-password>' -n orders
2. Add POSTGRES_PASSWORD to the deployment from that Secret (secretKeyRef).
3. Verify:
   kubectl rollout status deployment/orders-db -n orders
   kubectl get pod -n orders -l app=orders-db

Run it yourself

A standalone kind cluster with Keycloak, Solo Enterprise for kagent (the install brings the bundled MCP tool server and a default-model-config), the two agents, and the broken database.

export ANTHROPIC_API_KEY=sk-ant-...
export SOLO_LICENSE_KEY=...             # Solo Enterprise for kagent
export AGENTGATEWAY_LICENSE_KEY=...     # enterprise agentgateway
./scripts/quick.sh up

# drive the orchestrator; watch it delegate to the DBA over A2A
./scripts/ask.sh "the orders database won't start - investigate and tell me the fix"

# or talk to the specialist directly over its A2A endpoint
AGENT=dba-agent ./scripts/ask.sh "why is orders-db crashlooping?"

There is no kagent CLI dependency: ask.sh is plain curl against the A2A JSON-RPC endpoint. One sharp edge worth knowing, the agent path needs a trailing slash, otherwise the controller returns a 307 redirect that drops the POST body.

bashwhat ask.sh does (A2A message/send by hand)

kubectl -n kagent port-forward svc/kagent-controller 8083:8083 &

# discover: the agent card
curl localhost:8083/api/a2a/kagent/sre-orchestrator/.well-known/agent.json

# send a task (note the trailing slash on the agent path)
curl -X POST "http://localhost:8083/api/a2a/kagent/sre-orchestrator/" \
  -H 'Content-Type: application/json' \
  -d '{"jsonrpc":"2.0","id":"1","method":"message/send",
       "params":{"message":{"role":"user",
         "parts":[{"kind":"text","text":"the orders database won'\''t start"}],
         "messageId":"m1"}}}'

All the YAML

The A2A wiring is two fields. The orchestrator lists the specialist as a tool with type: Agent; the specialist advertises its skill with a2aConfig.skills (that is what becomes the agent card above). Everything the lab applies is here.

yamlyaml/agents/sre-orchestrator.yaml

apiVersion: kagent.dev/v1alpha2
kind: Agent
metadata:
  name: sre-orchestrator
  namespace: kagent
spec:
  type: Declarative
  description: On-call SRE orchestrator. Triages cluster incidents and delegates
    database-specific diagnosis to the dba-agent.
  declarative:
    modelConfig: default-model-config
    systemMessage: |
      You are the on-call SRE. Triage incidents by inspecting Kubernetes
      resources and events. When the problem is database-related DELEGATE the
      diagnosis to the dba-agent and incorporate its findings. Produce a short
      incident summary: what is broken, the root cause, the recommended
      remediation. Note which finding came from the DBA specialist.
    tools:
      - type: McpServer                     # its own k8s read tools
        mcpServer:
          kind: RemoteMCPServer
          name: kagent-tool-server
          toolNames: [k8s_get_resources, k8s_get_events, k8s_describe_resource]
      - type: Agent                         # <-- A2A delegation
        agent:
          name: dba-agent

yamlyaml/agents/dba-agent.yaml

apiVersion: kagent.dev/v1alpha2
kind: Agent
metadata:
  name: dba-agent
  namespace: kagent
spec:
  type: Declarative
  description: Database SRE specialist. Diagnoses Postgres/database workload
    problems from Kubernetes state.
  declarative:
    modelConfig: default-model-config
    systemMessage: |
      You are a database reliability specialist for the platform's Postgres
      databases. Given a symptom, inspect the relevant Deployment and pods
      (describe, logs, events), then explain the root cause in plain language and
      give the exact remediation. You diagnose and recommend; you do not apply
      destructive changes yourself.
    tools:
      - type: McpServer
        mcpServer:
          kind: RemoteMCPServer
          name: kagent-tool-server
          toolNames: [k8s_get_resources, k8s_describe_resource, k8s_get_pod_logs, k8s_get_events]
    a2aConfig:                              # <-- becomes the agent card
      skills:
        - id: diagnose-db
          name: Diagnose a database incident
          description: Find the root cause of a database workload failure and propose the fix
          inputModes: [text]
          outputModes: [text]
          tags: [database, postgres, sre]
          examples:
            - "The orders database pod is crashlooping, why?"
            - "Postgres won't start after the last deploy"

yamlyaml/incident/postgres.yaml: the planted incident

apiVersion: v1
kind: Namespace
metadata: { name: orders, labels: { purpose: a2a-demo } }
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: orders-db
  namespace: orders
  annotations: { incident/summary: "orders Postgres will not start after a deploy" }
spec:
  replicas: 1
  selector: { matchLabels: { app: orders-db } }
  template:
    metadata: { labels: { app: orders-db } }
    spec:
      containers:
        - name: postgres
          image: postgres:16-alpine
          # BROKEN ON PURPOSE: no POSTGRES_PASSWORD and no trust auth method.
          env:
            - { name: POSTGRES_DB, value: orders }
          ports: [{ containerPort: 5432 }]
          resources:
            requests: { cpu: 10m, memory: 32Mi }
            limits:   { cpu: 250m, memory: 128Mi }
---
apiVersion: v1
kind: Service
metadata: { name: orders-db, namespace: orders }
spec:
  selector: { app: orders-db }
  ports: [{ port: 5432, targetPort: 5432 }]

yamlyaml/accesspolicy/: who may call which agent (identity-driven)

# Alice's group may call the orchestrator (matched on the token's groups claim).
apiVersion: policy.kagent-enterprise.solo.io/v1alpha1
kind: AccessPolicy
metadata: { name: allow-fieldfte-orchestrator, namespace: kagent }
spec:
  action: ALLOW
  from:
    subjects:
      - kind: UserGroup
        userGroup:
          claimName: groups
          claimValue: field-fte
          issuer: http://keycloak.keycloak.svc.cluster.local/realms/solo
  targetRef: { kind: Agent, name: sre-orchestrator }
---
# The orchestrator (acting agent) may call the dba-agent.
apiVersion: policy.kagent-enterprise.solo.io/v1alpha1
kind: AccessPolicy
metadata: { name: allow-orchestrator-to-dba, namespace: kagent }
spec:
  action: ALLOW
  from:
    subjects: [{ kind: Agent, name: sre-orchestrator, namespace: kagent }]
  targetRef: { kind: Agent, name: dba-agent }
---
# The end user may NOT call the dba-agent directly.
apiVersion: policy.kagent-enterprise.solo.io/v1alpha1
kind: AccessPolicy
metadata: { name: deny-user-direct-dba, namespace: kagent }
spec:
  action: DENY
  from:
    subjects:
      - kind: UserGroup
        userGroup: { claimName: groups, claimValue: field-fte, issuer: http://keycloak.keycloak.svc.cluster.local/realms/solo }
  targetRef: { kind: Agent, name: dba-agent }

yamlkagent-enterprise install values (OIDC + OBO + role mapping)

# helm upgrade --install kagent oci://.../kagent-enterprise  (key values)
providers:
  default: anthropic
  anthropic: { apiKey: $ANTHROPIC_API_KEY }
oidc:
  issuer: http://keycloak.keycloak.svc.cluster.local/realms/solo
  clientId: kagent
  skipOBO: false                 # OBO on: exchange, don't just forward
kagent-tools: { enabled: true }  # bundled k8s MCP tool server
rbac:
  roleMapping:
    # map the Keycloak groups claim (lowercase!) to kagent roles
    roleMapper: 'claims.groups.transformList(i, v, v in rolesMap, rolesMap[v])'
    roleMappings:
      field-fte: global.Admin
      field-trial: global.Reader
      field-admin: global.Admin

yamlyaml/keycloak/keycloak.yaml: the IdP (realm solo: alice/field-fte)

apiVersion: apps/v1
kind: StatefulSet
metadata: { name: keycloak, namespace: keycloak }
spec:
  serviceName: keycloak
  replicas: 1
  selector: { matchLabels: { app: keycloak } }
  template:
    metadata: { labels: { app: keycloak } }
    spec:
      containers:
        - name: keycloak
          image: quay.io/keycloak/keycloak:26.3
          args: [start-dev, --import-realm]
          env:
            # Pin the issuer so the token `iss` matches the controller's oidc.issuer
            # whether minted via port-forward or in-cluster.
            - { name: KC_HOSTNAME, value: "http://keycloak.keycloak.svc.cluster.local" }
            - { name: KC_HOSTNAME_STRICT, value: "false" }
            - { name: KC_HOSTNAME_BACKCHANNEL_DYNAMIC, value: "true" }
          volumeMounts:
            - { name: realm-import, mountPath: /opt/keycloak/data/import, readOnly: true }
      volumes:
        - name: realm-import
          configMap: { name: keycloak-realm-import }
# realm `solo`: users alice/field-fte, bob/field-trial, carol/field-admin;
# public client `kagent` with a groups mapper and an audience=kagent mapper.

How the fix was made

The agents are read-only by design: the DBA diagnoses and recommends, it does not mutate the cluster. So the incident reads like an on-call thread, a human (or a pipeline) applies the recommended fix at the end:

#incidents orders-db will not start

alice09:14

@sre-orchestrator the orders database won't start, can you investigate and tell me the root cause and the exact fix?

sre-orchestratoragent09:14

Found orders-db in the orders namespace in CrashLoopBackOff (30 restarts, exit 1). It is a Postgres pod, so I am bringing in the DBA specialist to diagnose.

dba-agentagent · via A2A09:15

Root cause: the Deployment sets POSTGRES_DB=orders but no POSTGRES_PASSWORD and no trust auth, so the postgres entrypoint exits 1 on init. Fix: create a Secret with the superuser password and reference it as POSTGRES_PASSWORD via secretKeyRef, then roll the Deployment.

sre-orchestratoragent09:15

Incident summary: orders-db down due to a missing Postgres password (per the DBA specialist). Remediation: add POSTGRES_PASSWORD from a Secret and restart. Applying now.

✓

incident-bot09:16

Fix applied · deployment/orders-db rolled out · pod is Running · incident resolved.

Applying the DBA's exact remediation, the database recovers:

before: the planted incident

$ kubectl -n orders get pods -l app=orders-db
NAME                         READY   STATUS             RESTARTS
orders-db-58cc8c4df6-jsspl   0/1     CrashLoopBackOff   30

after: the recommended fix

$ kubectl -n orders create secret generic orders-db-secret \
    --from-literal=password='<strong-password>'
$ kubectl -n orders patch deployment orders-db ... POSTGRES_PASSWORD
$ kubectl -n orders rollout status deployment/orders-db
deployment "orders-db" successfully rolled out

NAME                         READY   STATUS    RESTARTS
orders-db-5c58775db8-gpqv4   1/1     Running   0

Identity rides the chain

Because the lab runs on Solo Enterprise for kagent with OIDC and On-Behalf-Of (OBO) turned on, the caller's identity travels with the A2A delegation. When the controller proxies Alice's call into an agent, it swaps her Keycloak token for a kagent-signed OBO token: her subject is preserved, a delegated act claim names the acting agent, and the issuer becomes kagent. show-obo.sh captures that token live and decodes it:

Inbound: Alice's Keycloak token

{
  "iss": "…/realms/solo",
  "sub": "ca0c9432-…-c66048cdfc37",
  "aud": "kagent",
  "groups": ["field-fte"]
  // no act claim
}

Exchanged: kagent OBO token (captured)

{
  "iss": "kagent.kagent",
  "sub": "ca0c9432-…-c66048cdfc37",  // preserved
  "act": {
    "sub": "…serviceaccount:kagent:sre-orchestrator"
  },
  "aud": ["kagent/sre-orchestrator"]
}

Honest scope note. On the direct agent-to-agent hop you saw above (orchestrator to DBA, pod to pod) identity travels as the headers x-user-id and x-kagent-source: agent; the signed OBO bearer rides the controller-to-agent hop. The AccessPolicy resources are accepted but enforce only through kagent's Istio authorization-policy translation, which needs a mesh. The A2A delegation and the OBO exchange are independent and both work as shown.

Extending it

Cross-namespace delegation. Add agent.namespace to the reference; the target opts in with spec.allowedNamespaces.
Approval gates. Put a mutating tool in the specialist's requireApproval list so a delegated call pauses for human confirmation before it runs.
More specialists. Add a network or storage agent and let the orchestrator pick the right one from each agent card's skills.
Metrics-aware diagnosis. Give the DBA the bundled Grafana MCP query_prometheus tool to read DB metrics alongside k8s state.

Versions

Built and verified on:

Enterprisevalidated 2026-06-18

Solo Enterprise for agentgatewayv2.3.4

Solo Enterprise for kagent0.4.3

Gateway APIv1.4.0

Agent-to-Agent in kagent: an SRE that delegates database incidents to a DBA