Scoring agents with agentevals: trace-based evaluation on kagent, by Tom O'Rourke

The agent-frameworks lab runs the same Kubernetes incident through five frameworks and ends with a table of LLM calls, tokens and latency per framework. This is how that table was produced. agentevals is an open-source tool that scores an agent from its OpenTelemetry traces, not by re-running it, so once each agent emits traces you can compare them all the same way: cost, latency, and whether they called the right tools. Everything here runs against the lab's existing kind cluster, so set that up first and come back.

Lab first. This guide assumes the agent-frameworks-kind cluster is up (the five crews, the broken checkout Deployment, and the gateway). Here we only add the evaluation layer on top of it.

How agentevals works

agentevals reads a trace file (OTLP JSON or Jaeger JSON), reconstructs the agent's invocations and tool calls, and scores them. It understands two trace shapes: the ADK trace format, and standard OpenTelemetry GenAI semantic conventions (gen_ai.* spans) emitted by LangChain, LiteLLM and others. Two kinds of metric come out of the box:

Performance: LLM calls, tokens (prompt and output), and latency, read straight from the spans. No reference needed, no judge model.
Behaviour: tool_trajectory_avg_score compares the tool calls the agent made against a golden eval set you write, plus LLM-as-judge metrics for response quality.

Install agentevals

It is a Python CLI. A virtualenv keeps it off your system Python.

# on your laptop, not in the cluster
python3 -m venv .aevals && source .aevals/bin/activate
pip install agentevals-cli

agentevals --help            # run, serve, evaluator, ...
agentevals evaluator list    # built-in metrics incl. tool_trajectory_avg_score

Set up trace capture

The agents run in the cluster, so the cleanest path is to collect their traces in the cluster too. Three steps: a collector to receive the traces, kagent's tracing switch, and a little per-framework instrumentation.

1. An OpenTelemetry collector that writes a file

A standard collector with an OTLP receiver and a file exporter. The collector image is distroless, so a small busybox sidecar shares the data volume and lets you read the file out.

yamlyaml/eval/otel-collector.yaml (config, abbreviated)

receivers:
  otlp:
    protocols:
      http: { endpoint: 0.0.0.0:4318 }
      grpc: { endpoint: 0.0.0.0:4317 }
exporters:
  file: { path: /data/otlp.json }
  debug: { verbosity: basic }
service:
  pipelines:
    traces: { receivers: [otlp], exporters: [file, debug] }
# Deployment: otel/opentelemetry-collector-contrib + a busybox "reader" sidecar
# mounting the same emptyDir at /data, so you can `kubectl exec ... -c reader -- cat`.

2. Turn on kagent tracing

kagent injects OTEL_TRACING_ENABLED into every agent pod from the controller's ConfigMap, and its value wins over anything you set per-agent. So flip it there, then point each agent you care about at the collector. The OTLP HTTP endpoint must include the /v1/traces path, because kagent passes the endpoint to the exporter verbatim (it does not append the path for you).

# enable tracing globally, then restart the controller to pick it up
kubectl -n kagent patch cm kagent-controller --type merge \
  -p '{"data":{"OTEL_TRACING_ENABLED":"true"}}'
kubectl -n kagent rollout restart deploy/kagent-controller

# per agent: send traces to the collector (note the /v1/traces path)
kubectl -n kagent patch agent sre-crew-langgraph --type=json -p '[{"op":"add",
  "path":"/spec/byo/deployment/env/-","value":{
    "name":"OTEL_EXPORTER_OTLP_TRACES_ENDPOINT",
    "value":"http://otel-collector.eval.svc.cluster.local:4318/v1/traces"}}]'

3. Instrument each framework

Turning the switch on is not enough on its own: each framework needs its tracer set up so the model and tool calls become spans. This is the part that differs per framework.

Framework	What to add
LangGraph	Build the app with `KAgentApp(graph=graph, ..., tracing=True)`. kagent-core then instruments the OpenAI client and httpx.
Google ADK	The ADK `KAgentApp` does not start tracing itself: call `kagent.core.configure_tracing(name, namespace, fastapi_app=app)` on the built app.
CrewAI	Same `configure_tracing` call, plus `litellm.callbacks = ["otel"]`, because CrewAI runs the model through LiteLLM, which the OpenAI instrumentor does not see.
AutoGen	No kagent-core, so set up a small TracerProvider with an OTLP exporter and the OpenAI instrumentor yourself (about ten lines).

The LiteLLM gotcha. LiteLLM's OTel callback exports over gRPC, and it reads the standard OTEL_EXPORTER_OTLP_TRACES_ENDPOINT. Point that at the collector's gRPC port (:4317, no path) for CrewAI, not the HTTP port, or every export fails with StatusCode.UNAVAILABLE. The HTTP-exporting frameworks use :4318/v1/traces. Same collector, two ports.

The exact code is in the lab's source under src/sre-crew-* and yaml/eval/ at github.com/tjorourke/solo-labs/tree/main/agent-frameworks-kind.

Write a golden eval set

To score behaviour, agentevals needs to know what good looks like. The golden eval set is a Google ADK EvalSet: for our incident, the expected tool trajectory is inspect, then patch the bad image.

jsonyaml/eval/golden-evalset.json (abbreviated)

{
  "eval_set_id": "sre-checkout-incident",
  "eval_cases": [{
    "eval_id": "diagnose-and-fix-checkout",
    "conversation": [{
      "invocation_id": "incident-1",
      "user_content": { "role": "user",
        "parts": [{ "text": "the checkout service is down - investigate, then fix it" }] },
      "intermediate_data": {
        "tool_uses": [
          { "name": "get_pods", "args": { "namespace": "incident" } },
          { "name": "describe_deployment", "args": { "namespace": "incident", "name": "checkout" } },
          { "name": "patch_deployment_image",
            "args": { "namespace": "incident", "name": "checkout",
                      "container": "checkout", "image": "nginx:1.27" } }
        ]
      }
    }]
  }]
}

Get the stats

Run an incident so the agent emits a trace, read the file out of the collector, and score it. One wrinkle: the collector's file exporter writes one OTLP batch per line (newline-delimited JSON), while agentevals wants a single OTLP document, so merge the lines first.

# 1. drive the incident (as Alice, through kagent) so the agent traces
AGENT=sre-crew-langgraph ./scripts/ask.sh "the checkout service is down - investigate, then fix it"

# 2. read the trace file out of the collector's reader sidecar
COL=$(kubectl -n eval get pod -l app=otel-collector -o jsonpath='{.items[0].metadata.name}')
kubectl -n eval exec "$COL" -c reader -- cat /data/otlp.json > trace.ndjson

# 3. merge the per-line batches into one OTLP doc
python3 -c 'import json,sys; rs=[]
for l in open("trace.ndjson"):
    l=l.strip()
    if l: rs+=json.loads(l).get("resourceSpans",[])
json.dump({"resourceSpans":rs}, open("trace.json","w"))'

# 4. score it
agentevals run trace.json -f otlp-json \
  -e yaml/eval/golden-evalset.json \
  -m tool_trajectory_avg_score --trajectory-match-type ANY_ORDER -o table

The table shows the trajectory score, and under it a performance block with exactly the numbers the lab reports: total tokens (prompt and output), the LLM-call count, and latency percentiles. Run it for each framework's trace and you have the comparison.

textagentevals output (LangGraph, trimmed)

Trace ... (1 invocations)
  tool_trajectory_avg_score: ...

  Performance Metrics:
    Model: claude-haiku-4-5
    Counts: 3 LLM calls, 1 invocations
    Tokens: 7024 total (6597 prompt + 427 output)
    Overall Latency: p50=5.1s

What came back

Across the four bring-your-own frameworks, scored from their traces for the same incident on the same model and tools:

Framework	LLM calls	Tokens (prompt + output)	Latency (p50)
LangGraph	3	7,024 (6,597 + 427)	5.1s
Google ADK	5	15,271 (14,315 + 956)	5.2s
AutoGen	6	23,170 (22,138 + 1,032)	12.2s
CrewAI	52	87,274 (83,982 + 3,292)	27.9s

The figures move a little run to run, since the model is non-deterministic, so average a few before treating them as definitive. Two honest caveats on the trajectory score: the LangGraph run pauses at its human-approval step, so a single pass stops before applying the patch; and CrewAI's LiteLLM spans report cost cleanly but do not expose individual tool calls in the same shape, so its trajectory is not scored here. The performance numbers above are solid for all four.