The agent-frameworks lab runs the same Kubernetes incident through five frameworks and ends with a table of LLM calls, tokens and latency per framework. This is how that table was produced. agentevals is an open-source tool that scores an agent from its OpenTelemetry traces, not by re-running it, so once each agent emits traces you can compare them all the same way: cost, latency, and whether they called the right tools. Everything here runs against the lab's existing kind cluster, so set that up first and come back.
checkout Deployment, and the gateway). Here we
only add the evaluation layer on top of it.
How agentevals works
agentevals reads a trace file (OTLP JSON or Jaeger JSON), reconstructs the agent's
invocations and tool calls, and scores them. It understands two trace shapes: the
ADK trace format, and standard OpenTelemetry GenAI semantic conventions
(gen_ai.* spans) emitted by LangChain, LiteLLM and others. Two kinds of
metric come out of the box:
- Performance: LLM calls, tokens (prompt and output), and latency, read straight from the spans. No reference needed, no judge model.
- Behaviour:
tool_trajectory_avg_scorecompares the tool calls the agent made against a golden eval set you write, plus LLM-as-judge metrics for response quality.
Install agentevals
It is a Python CLI. A virtualenv keeps it off your system Python.
# on your laptop, not in the cluster python3 -m venv .aevals && source .aevals/bin/activate pip install agentevals-cli agentevals --help # run, serve, evaluator, ... agentevals evaluator list # built-in metrics incl. tool_trajectory_avg_score
Set up trace capture
The agents run in the cluster, so the cleanest path is to collect their traces in the cluster too. Three steps: a collector to receive the traces, kagent's tracing switch, and a little per-framework instrumentation.
1. An OpenTelemetry collector that writes a file
A standard collector with an OTLP receiver and a file exporter. The
collector image is distroless, so a small busybox sidecar shares the
data volume and lets you read the file out.
yamlyaml/eval/otel-collector.yaml (config, abbreviated)
receivers:
otlp:
protocols:
http: { endpoint: 0.0.0.0:4318 }
grpc: { endpoint: 0.0.0.0:4317 }
exporters:
file: { path: /data/otlp.json }
debug: { verbosity: basic }
service:
pipelines:
traces: { receivers: [otlp], exporters: [file, debug] }
# Deployment: otel/opentelemetry-collector-contrib + a busybox "reader" sidecar
# mounting the same emptyDir at /data, so you can `kubectl exec ... -c reader -- cat`.
2. Turn on kagent tracing
kagent injects OTEL_TRACING_ENABLED into every agent pod from the
controller's ConfigMap, and its value wins over anything you set per-agent. So flip
it there, then point each agent you care about at the collector. The OTLP HTTP
endpoint must include the /v1/traces path, because kagent passes the
endpoint to the exporter verbatim (it does not append the path for you).
# enable tracing globally, then restart the controller to pick it up kubectl -n kagent patch cm kagent-controller --type merge \ -p '{"data":{"OTEL_TRACING_ENABLED":"true"}}' kubectl -n kagent rollout restart deploy/kagent-controller # per agent: send traces to the collector (note the /v1/traces path) kubectl -n kagent patch agent sre-crew-langgraph --type=json -p '[{"op":"add", "path":"/spec/byo/deployment/env/-","value":{ "name":"OTEL_EXPORTER_OTLP_TRACES_ENDPOINT", "value":"http://otel-collector.eval.svc.cluster.local:4318/v1/traces"}}]'
3. Instrument each framework
Turning the switch on is not enough on its own: each framework needs its tracer set up so the model and tool calls become spans. This is the part that differs per framework.
| Framework | What to add |
|---|---|
| LangGraph | Build the app with KAgentApp(graph=graph, ..., tracing=True). kagent-core then instruments the OpenAI client and httpx. |
| Google ADK | The ADK KAgentApp does not start tracing itself: call kagent.core.configure_tracing(name, namespace, fastapi_app=app) on the built app. |
| CrewAI | Same configure_tracing call, plus litellm.callbacks = ["otel"], because CrewAI runs the model through LiteLLM, which the OpenAI instrumentor does not see. |
| AutoGen | No kagent-core, so set up a small TracerProvider with an OTLP exporter and the OpenAI instrumentor yourself (about ten lines). |
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT.
Point that at the collector's gRPC port (:4317, no path) for CrewAI, not
the HTTP port, or every export fails with StatusCode.UNAVAILABLE. The
HTTP-exporting frameworks use :4318/v1/traces. Same collector, two ports.
The exact code is in the lab's source under src/sre-crew-* and
yaml/eval/ at
github.com/tjorourke/solo-labs/tree/main/agent-frameworks-kind.
Write a golden eval set
To score behaviour, agentevals needs to know what good looks like. The golden eval
set is a Google ADK EvalSet: for our incident, the expected tool
trajectory is inspect, then patch the bad image.
jsonyaml/eval/golden-evalset.json (abbreviated)
{
"eval_set_id": "sre-checkout-incident",
"eval_cases": [{
"eval_id": "diagnose-and-fix-checkout",
"conversation": [{
"invocation_id": "incident-1",
"user_content": { "role": "user",
"parts": [{ "text": "the checkout service is down - investigate, then fix it" }] },
"intermediate_data": {
"tool_uses": [
{ "name": "get_pods", "args": { "namespace": "incident" } },
{ "name": "describe_deployment", "args": { "namespace": "incident", "name": "checkout" } },
{ "name": "patch_deployment_image",
"args": { "namespace": "incident", "name": "checkout",
"container": "checkout", "image": "nginx:1.27" } }
]
}
}]
}]
}
Get the stats
Run an incident so the agent emits a trace, read the file out of the collector, and score it. One wrinkle: the collector's file exporter writes one OTLP batch per line (newline-delimited JSON), while agentevals wants a single OTLP document, so merge the lines first.
# 1. drive the incident (as Alice, through kagent) so the agent traces AGENT=sre-crew-langgraph ./scripts/ask.sh "the checkout service is down - investigate, then fix it" # 2. read the trace file out of the collector's reader sidecar COL=$(kubectl -n eval get pod -l app=otel-collector -o jsonpath='{.items[0].metadata.name}') kubectl -n eval exec "$COL" -c reader -- cat /data/otlp.json > trace.ndjson # 3. merge the per-line batches into one OTLP doc python3 -c 'import json,sys; rs=[] for l in open("trace.ndjson"): l=l.strip() if l: rs+=json.loads(l).get("resourceSpans",[]) json.dump({"resourceSpans":rs}, open("trace.json","w"))' # 4. score it agentevals run trace.json -f otlp-json \ -e yaml/eval/golden-evalset.json \ -m tool_trajectory_avg_score --trajectory-match-type ANY_ORDER -o table
The table shows the trajectory score, and under it a performance block with exactly the numbers the lab reports: total tokens (prompt and output), the LLM-call count, and latency percentiles. Run it for each framework's trace and you have the comparison.
textagentevals output (LangGraph, trimmed)
Trace ... (1 invocations)
tool_trajectory_avg_score: ...
Performance Metrics:
Model: claude-haiku-4-5
Counts: 3 LLM calls, 1 invocations
Tokens: 7024 total (6597 prompt + 427 output)
Overall Latency: p50=5.1s
What came back
Across the four bring-your-own frameworks, scored from their traces for the same incident on the same model and tools:
| Framework | LLM calls | Tokens (prompt + output) | Latency (p50) |
|---|---|---|---|
| LangGraph | 3 | 7,024 (6,597 + 427) | 5.1s |
| Google ADK | 5 | 15,271 (14,315 + 956) | 5.2s |
| AutoGen | 6 | 23,170 (22,138 + 1,032) | 12.2s |
| CrewAI | 52 | 87,274 (83,982 + 3,292) | 27.9s |
The figures move a little run to run, since the model is non-deterministic, so average a few before treating them as definitive. Two honest caveats on the trajectory score: the LangGraph run pauses at its human-approval step, so a single pass stops before applying the patch; and CrewAI's LiteLLM spans report cost cleanly but do not expose individual tool calls in the same shape, so its trajectory is not scored here. The performance numbers above are solid for all four.
See also
- The agent-frameworks lab: the five frameworks, the incident, and the comparison this guide measures (Appendix 2)
- agentevals and the agentevals-dev/agentevals repo
- OpenTelemetry GenAI semantic conventions