MastertheMesh
Solo · agentgateway · kagent · token budgets · JWT · RateLimitConfig · LangGraph · kind
Built · Runs on kind

Per-Team LLM Token Budgets at the Gateway

TO
Tom O'Rourke
EMEA Field CTO · Solo.io

Two teams. One LLM. Different budgets. The agents don't enforce anything — the gateway counts response tokens per team, debits the right bucket, and returns 429 when a team blows through its monthly LLM spend.

Solo Enterprise agentgateway RateLimitConfig · type:TOKEN JWT auth · jwt.team → X-Team-ID kagent BYO LangGraph Prometheus + Grafana kind

The premise. When agents start chatting with frontier LLMs at production volume, the bill stops being theoretical. Token spend is the new cloud spend: it bursts unpredictably, it correlates with prompt length more than user count, and a single runaway loop in one team can drain the organisation's monthly budget in an afternoon.

Putting "stop the spend" in the agent code doesn't scale — every team writes its own enforcement, every refactor risks bypassing it, and the audit story becomes "look at N agents' logs". Putting it at the gateway means one declarative budget per team, measured against the LLM's own usage.total_tokens field, with one place to look when a team asks "did we get throttled today?".

DBA team

5,000 tokens / hour

Per day
50,000 tokens
JWT claims
sub=dba, team=dba
Burns through
~4–6 long-essay prompts (400–1500 completion tokens each)

Support team

20,000 tokens / hour

Per day
200,000 tokens
JWT claims
sub=support, team=support
Burns through
~20+ prompts — comfortable headroom for the demo

What you'll build

DBA-AGENT team=dba · 5k/hr · 50k/day env LLM_JWT ← jwt-dba Authorization: Bearer ⟨dba⟩ SUPPORT-AGENT team=support · 20k/hr · 200k/day env LLM_JWT ← jwt-support Authorization: Bearer ⟨support⟩ PROMETHEUS + GRAFANA scrapes /metrics every 10s from agentgateway + ratelimit read-only · not in request path AGENTGATEWAY · DATA PLANE EnterpriseAgentgatewayPolicy: jwt-auth (Strict RS256 · transformation: X-Team-ID = jwt.team) 1. validate JWT → 2. stamp X-Team-ID → 3. pre-flight RL → 4. upstream → 5. parse usage → 6. debit EnterpriseAgentgatewayPolicy: team-budget · entRateLimit → RateLimitConfig "team-token-budget" (type: TOKEN) descriptors: team=dba → 5k/hr · 50k/day | team=support → 20k/hr · 200k/day RATE-LIMIT-SERVICE · gRPC :8083 gcr.io/gloo-mesh/rate-limiter:0.18.5 Implements Envoy ratelimit gRPC API Pre-flight: "is team^dba over budget?" Post-flight: "INCR team^dba by N" REDIS · ext-cache · :6379 deployment: ext-cache-enterprise-agentgateway Per-team counters, rolling hour/day windows Auto-installed by the enterprise-agentgateway chart (emptyDir storage — counters reset on pod restart) jwt-issuer (Go) RSA key + 2 JWTs /.well-known/jwks.json mock-llm (Python) /v1/chat/completions · 10 essay templates usage.total_tokens = prompt + 400..1500 Bearer JWT forward (if budget OK) pre-flight ↑ ↓ debit GET INCR JWKS · remote backendRef scrape scrape

Two agent Deployments running the same image — only the mounted JWT Secret differs. On every request the gateway validates the JWT, stamps X-Team-ID from jwt.team, and makes a gRPC pre-flight check to the rate-limit-service (does the team have any budget left?). If yes, it forwards to the LLM. When the response comes back, the gateway parses usage.total_tokens and makes a second gRPC call to debit that team's bucket in Redis. The very next request from a team whose bucket is now over the line gets HTTP 429. Prometheus scrapes both the gateway and the rate-limit-service every 10s; the Grafana dashboard is a read-only view.

About — where Redis comes from (you don't deploy it, the chart does)

The rate-limit-service needs somewhere to store per-team counters that survives gateway restarts, holds the rolling-window state, and stays consistent across rate-limit-service replicas. The Solo enterprise-agentgateway helm chart installs a Redis Deployment for that purpose — you don't need to deploy it, configure it, or reference it from your own YAML.

What the chart installs

Persistence: none, by default

The Redis pod uses an emptyDir volume mounted at /data — not a PVC. There is no StatefulSet, no volumeClaimTemplates, no --appendonly yes arg on the Redis container. So:

Peeking at the buckets

The keys the rate-limiter writes look like ratelimit_solo_io:<descriptor>. To see what's in there right now:

kubectl -n agentgateway-system get deploy ext-cache-enterprise-agentgateway
kubectl -n agentgateway-system get svc  ext-cache-enterprise-agentgateway

# Peek at the actual counter keys + values:
POD=$(kubectl -n agentgateway-system get pod -l app=ext-cache -o name | head -1)
kubectl -n agentgateway-system exec "$POD" -- redis-cli --scan --pattern '*team*'
kubectl -n agentgateway-system exec "$POD" -- redis-cli GET '<the-key>'

Why budgets at the gateway

ConcernIn-agent token budgetvsGateway-enforced budget
Where the policy lives Each agent's code or config — many copies, easy to drift vs One RateLimitConfig on the path every agent goes through
What counts as "tokens" Whatever the agent decides to count (often just the request) vs The provider's usage.total_tokens — the same number the bill is based on
Identity model Often baked into the agent's config — no per-team isolation vs One JWT per team; gateway projects jwt.teamX-Team-ID; rate-limit reads that header
Audit story Each agent's stdout — N agents, N log shapes vs Single gateway counter + 429 line per rejection, scraped by Prometheus
Bypass risk Anyone with code access can raise the limit vs Limit is in a CRD — change-control story is the same as any infrastructure policy

Steps

1. Clone and bring it up

About — what this does & why

quick.sh up runs 01..07 in order, all idempotent. Needs only a Solo Enterprise licence key — the chat agents talk to the in-cluster mock LLM, not OpenAI, so no provider key is required.

Bashclone, set the license key, bring up the kind cluster
git clone https://github.com/tjorourke/solo-labs.git
cd solo/agentic-budgets-kind

export AGENTGATEWAY_LICENSE_KEY=...

./scripts/quick.sh up
./scripts/port-forward.sh   # leave running

Then open:

2. The mock LLM (one Python file, deterministic usage)

About — why a mock instead of OpenAI

The lab is about the budget mechanism, not LLM quality. Hitting a real provider during rehearsals would cost money and would make "budget exhaustion in five prompts" hard to reproduce. The mock returns one of 10 pre-baked essay templates and inserts a realistic usage block; the gateway sees the same response shape it would from real OpenAI.

Pythonsrc/mock-llm/server.py — handler excerpt (the bit the rate-limit reads)
async def chat_completions(request: Request) -> JSONResponse:
    body = await request.json()
    messages = body.get("messages") or []
    completion_text = random.choice(TEMPLATES)
    prompt_tokens     = _approx_prompt_tokens(messages)         # word_count × 1.3
    completion_tokens = random.randint(400, 1500)               # variable
    total_tokens      = prompt_tokens + completion_tokens

    return JSONResponse({
        "id": f"chatcmpl-{uuid.uuid4().hex[:24]}",
        "model": body.get("model") or "mock-essay-7b",
        "choices": [{"message": {"role": "assistant", "content": completion_text}}],
        "usage": {
            "prompt_tokens":     prompt_tokens,
            "completion_tokens": completion_tokens,
            "total_tokens":      total_tokens,    # ◄── the gateway debits this
        },
    })

3. The JWT issuer — identity for every request

About — why JWTs carry the identity

Per-team token budgets only make sense if the gateway knows which team sent each request. The wire shape for "who" is a signed JWT — a token carrying a team claim that the gateway verifies against the issuer's public key. The gateway then projects the validated jwt.team claim into an X-Team-ID request header (a CEL transformation on the JWT policy — see section 4), and the rate-limit configuration reads that header to pick the per-team bucket.

In production this is your existing IdP — Entra ID, Okta, Auth0, Keycloak. The lab ships a small in-cluster issuer that mints the same standard-shape JWTs (RS256, JWKS at /.well-known/jwks.json) so the whole identity round-trip is visible end-to-end without standing up an external IdP first. Each kagent chat agent mounts its team's token as $LLM_JWT from a Secret and presents it on every LLM call — it does NOT set X-Team-ID itself; the gateway stamps that from the validated claim so a malicious caller can't spoof another team's bucket.

Gosrc/jwt-issuer/main.go — claim shape (excerpt)
claims := jwt.MapClaims{
    "iss":  "agentic-budgets-kind",
    "sub":  "dba",
    "team": "dba",
    "iat":  now.Unix(),
    "exp":  now.AddDate(10, 0, 0).Unix(),
    "aud":  "mock-llm",
}
tok := jwt.NewWithClaims(jwt.SigningMethodRS256, claims)
tok.Header["kid"] = "agentic-budgets-key-1"
signed, _ := tok.SignedString(priv)

4. The gateway side — JWT auth + claim projection + token-based RateLimitConfig

About — three policies, one identity

Three things happen on every request before the LLM is even called:

  1. JWT validation. The gateway verifies the bearer token's signature against the JWKS served by jwt-issuer. Strict-mode — no token means 401 before reaching any backend.
  2. Claim projection. The same JWT policy includes a transformation that sets X-Team-ID from the validated jwt.team claim. The value: field of each header is a strict CEL expression, so value: jwt.team resolves to the team string. The transformation uses set: not add:, so any client-sent X-Team-ID is overwritten.
  3. Token rate-limit. The RateLimitConfig has one descriptor per team, each with a different requestsPerUnit ceiling. The requestHeaders action reads X-Team-ID (just stamped by step 2) and matches against the right descriptor. type: TOKEN debits the response's usage.total_tokens, not the request count.

The agent does NOT set X-Team-ID itself — the gateway is the source of truth, so a malicious caller can't spoof another team's bucket by sending a different header value.

YAMLyaml/agentgateway/jwt-policy.yaml — JWT auth + claim → header projection
apiVersion: enterpriseagentgateway.solo.io/v1alpha1
kind: EnterpriseAgentgatewayPolicy
metadata: { name: jwt-auth, namespace: agentgateway-system }
spec:
  targetRefs:
    - { group: gateway.networking.k8s.io, kind: Gateway, name: budgets-gateway }
  traffic:
    jwtAuthentication:
      mode: Strict
      providers:
        - issuer: "agentic-budgets-kind"
          audiences: ["mock-llm"]
          jwks:
            remote:
              jwksPath: "/.well-known/jwks.json"
              backendRef: { group: "", kind: Service, name: jwt-issuer, namespace: budgets, port: 8080 }
    transformation:
      request:
        set:
          - name: X-Team-ID
            value: jwt.team        # ← strict CEL expression, resolves to the team claim
YAMLyaml/agentgateway/ratelimit-config.yaml — the entire budget
apiVersion: ratelimit.solo.io/v1alpha1
kind: RateLimitConfig
metadata: { name: team-token-budget, namespace: agentgateway-system }
spec:
  raw:
    descriptors:
      - key: team
        value: "dba"
        rateLimit: { requestsPerUnit: 5000,  unit: HOUR }
      - key: team
        value: "support"
        rateLimit: { requestsPerUnit: 20000, unit: HOUR }
    rateLimits:
      - actions:
          - requestHeaders:
              descriptorKey: "team"
              headerName: "X-Team-ID"
        type: TOKEN
YAMLyaml/agentgateway/ratelimit-policy.yaml — attach to the LLM route
apiVersion: enterpriseagentgateway.solo.io/v1alpha1
kind: EnterpriseAgentgatewayPolicy
metadata: { name: team-budget, namespace: llm }   # same namespace as llm-route
spec:
  targetRefs:                                      # local refs — no namespace field
    - { group: gateway.networking.k8s.io, kind: HTTPRoute, name: llm-route }
  traffic:
    entRateLimit:
      global:
        rateLimitConfigRefs:
          - { name: team-token-budget, namespace: agentgateway-system }

5. The agent side — identical code, different JWT

About — two agents, one image

The two kagent Agent CRs apply the same BYO image; only the mounted JWT Secret differs. The agent reads $LLM_JWT once at startup and passes it on every /v1/chat/completions request. No langchain LLM client — a plain httpx.AsyncClient.post, so the 429 surfaces verbatim instead of being eaten by a retry layer.

Pythonsrc/langgraph-agent/agent.py — _call_llm (excerpt)
async with httpx.AsyncClient(timeout=REQUEST_TIMEOUT) as client:
    resp = await client.post(LLM_URL, json=payload, headers={
        "Authorization": f"Bearer {LLM_JWT}",            # ◄── identity
        "Content-Type":  "application/json",
    })

if resp.status_code == 429:
    return (
        f"Sorry — your team's LLM token budget is exhausted. "
        f"Please try again later. (HTTP 429 from the gateway; "
        f"team={TEAM_LABEL!r}.)"
    )
…
return resp.json()["choices"][0]["message"]["content"]

Walk through the demo

Open http://localhost:8080 in one tab and http://localhost:3000 in another. In Grafana, navigate to the "Per-Team LLM Token Budgets" dashboard. Now run through four scenes.

Scene 1 — Baseline: both teams idle

Open kagent dashboard + Grafana side-by-side.

Agents
Two agents listed: dba-agent and support-agent. Neither has been prompted yet.
Grafana
All four stat panels read 0. The "tokens consumed — live" timeseries is a flatline.
┌─ Per-Team LLM Token Budgets ─────────────────────────────────┐ │ This hour │ │ dba: 0 / 5,000 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░ green │ │ support: 0 / 20,000 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░ green │ │ Today │ │ dba: 0 / 50,000 │ │ support: 0 / 200,000 │ │ 429 rejections (last 1h): — │ └───────────────────────────────────────────────────────────────┘
Why this matters: the budgets are CRDs the platform team owns. Nothing in the agent code or the LLM provider knows them. Adding a third team would mean adding one Secret, one Agent CR, and four lines to the RateLimitConfig.

Scene 2 — DBA burns through its hourly budget

Pick dba-agent. Prompt 4–5 times in a row:

You type
Write me a long essay on indexing strategies.
Agent replies
Each prompt comes back with one of the canned essay templates plus an internal usage log showing 400–1500 completion tokens. The agent doesn't surface the count to the user — but Grafana does.
Grafana dashboard 'Per-Team LLM Token Budgets'. Top panel: 'Tokens consumed — per team (5-min rate, tokens/sec)' showing dba's green line climbing as essays are sent. Four stat tiles: 'Tokens this hour — dba (budget 5,000)' showing 5.68k in red (over budget), 'Tokens this hour — support (budget 20,000)' showing 2.14k in green, 'Gen-AI tokens — gateway (output)' 6.52k, 'Gen-AI tokens — gateway (input)' 1.30k. Bottom panel: 'Over-budget tokens — per team (last 1h)'.
Live Grafana view: dba tile turned red after crossing 5k tokens. The gateway is reading usage.total_tokens off each response and debiting the per-team bucket — these numbers are the same ones the cloud bill would be based on.
Why this matters: the gateway is reading the LLM's usage.total_tokens field directly off the response body and debiting the per-team bucket. The number on the dashboard is the same number the cloud bill would be based on — not a request count, not a pre-flight estimate. Real spend.

Scene 3 — DBA blocked; Support unaffected

Still as dba-agent. One more prompt — pushes over 5,000.

You type
Now write an essay on query plans.
Agent replies
“Sorry — your team's LLM token budget is exhausted. Please try again later. (HTTP 429 from the gateway; team='dba'.)” That's the raw 429 from the gateway, surfaced through the agent's chat output.
kagent chat with dba-agent. The user prompts 'Write me a long essay on indexing strategies.' and the agent replies: 'Sorry — your team's LLM token budget is exhausted. Please try again later. (HTTP 429 from the gateway; team=dba.)'. The Agent Details sidebar says: 'DBA team chat agent. Calls the gateway-fronted mock LLM with a team=dba JWT; gets 429 when the team's hourly token budget is exhausted.'
The agent surfaces the gateway's 429 verbatim, with the team name. The chat doesn't loop, doesn't degrade — it tells the user the budget is gone.

Now switch to support-agent. Same prompt.

Agent replies
Gets a perfectly normal essay back. Support's budget is separate — the gateway debits a different bucket.
kagent chat with support-agent. The user prompts 'Write me a long essay on indexing strategies.' and the agent returns a full essay-style response about customer-support metrics. The Agent Details sidebar says: 'Customer-support team chat agent. Same code as dba-agent, but with a team=support JWT and a 4x larger token budget.'
Same agent code, different JWT — support's 20k/hr bucket is untouched, so the request goes through normally.
Why this matters: the budget is a contract per team, not per process or per request. One noisy team can hit its ceiling without affecting any other team's ability to keep working. From an SRE perspective, “DBA is over budget” becomes a routine alert, not an outage.

Scene 4 — Reset: the bucket refills

Wait the hour, or clear the in-memory counter:

Shell
kubectl -n agentgateway-system delete pod -l app.kubernetes.io/component=ratelimit
Now try
Switch back to dba-agent and prompt again. Normal essay response, dashboard climbing from zero again.
Why this matters: rate-limit windows are time-bucketed, not stateful in the per-request sense. A team is “over budget” for the rest of the current window only — when the next window rolls, they're back in business. For demo speed, restarting the rate-limit service zeros the buckets instantly. In production you'd let the natural hour-rollover take care of it.

Going per-user: token counts in access logs

Everything above keys on the team. The bucket key is just a descriptor, so the same shape works per user: swap the jwt.team claim for a per-user identity like jwt.sub and each distinct user gets their own bucket. The lab mints distinct per-user identities (alice and bob on the dba team, carol on support) so you can see real per-user usage even while the budget stays keyed per team.

Runs on kind. This is wired up and validated end to end. The gateway stamps user_id and the token counts on every access line, Promtail ships them to Loki, and the Grafana dashboard shows a per-user token table, a per-user log panel, and a per-user tokens-over-time chart, sitting next to the per-team budget panels. Bring it up with ./scripts/quick.sh up and open the dashboard.

Grafana dashboard, per-user section. Left panel 'Tokens by user — sum over range (access logs · Loki)' is a table: alice 4.69K, bob 719, carol 6.09K, support 6.37K. Right panel 'Per-user LLM access log (user · team · tokens · model · status)' shows JSON-RPC access lines, each carrying user, team, tokens_total, in, out, model and status — support and carol requests at status=200 with token counts like 1450 and 1361, alice and dba requests at status=429 with tokens_total=n/a once the dba team budget is spent. Bottom panel 'Tokens per user over time (5-min sum)' charts one line per user.
Per-user view alongside the per-team budget panels: a token sum per user (left), the raw access log carrying user_id and token counts (right), and tokens per user over time (bottom). Click to enlarge.

Enforcement and visibility are two separate jobs, though. The rate-limit counter caps spend, but to actually see per-user token counts you want the numbers attributed to an identity. You could add a per-user label to the Prometheus token metric, but a label per user becomes one live time series per user, which gets expensive at thousands of users. Access logs avoid that: the identity rides in the log payload, not as a live series, and you sum tokens per user at query time in the log store. Keep Prometheus for the aggregate dashboard, use access logs for the per-user breakdown.

Add the attribution with a second policy on the Gateway. The llm.* token variables are available in the logging context, and jwt.sub gives you the identity:

YAMLper-user token attribution on the access log
apiVersion: enterpriseagentgateway.solo.io/v1alpha1
kind: EnterpriseAgentgatewayPolicy
metadata:
  name: token-attribution-logging
  namespace: agentgateway-system
spec:
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: budgets-gateway
  frontend:
    accessLog:
      attributes:
        add:
          - { name: user_id,      expression: 'default(jwt.sub, "unknown")' }
          - { name: tokens_total, expression: 'default(string(llm.totalTokens), "n/a")' }
          - { name: tokens_in,    expression: 'default(string(llm.inputTokens), "n/a")' }
          - { name: tokens_out,   expression: 'default(string(llm.outputTokens), "n/a")' }
          - { name: model,        expression: 'default(llm.responseModel, "n/a")' }
          - { name: status,       expression: 'string(response.code)' }

Every LLM request now logs a line carrying user_id=… tokens_total=… model=… status=…. In the demo the proxy writes to stdout, so you can sum per user straight off the pod logs:

Bashsum tokens per user from the gateway access log
kubectl --context kind-budgets -n agentgateway-system logs deploy/budgets-gateway \
  | grep -oE 'user_id=[^ ]+ .*tokens_total=[0-9]+' \
  | awk '{for(i=1;i<=NF;i++){if($i~/^user_id=/)u=substr($i,9);
          if($i~/^tokens_total=/)t=substr($i,14)} sum[u]+=t}
         END{for(u in sum) printf "%-16s %d tokens\n", u, sum[u]}'
# dba      18342 tokens
# support   9551 tokens

In production you ship the logs off-box instead of tailing a pod. The same frontend.accessLog has an otlp sink: point it at an OpenTelemetry collector and forward to Loki, Elastic, BigQuery, or whatever you run, then aggregate there.

YAMLexport access logs to a collector, then aggregate in the log store
  frontend:
    accessLog:
      attributes:
        add: [ ... as above ... ]
      otlp:
        backendRef:
          kind: Service
          name: otel-collector
          namespace: monitoring
          port: 4317

# Then, per-user tokens in Loki:
#   sum by (user_id) (
#     sum_over_time({app="budgets-gateway"} | logfmt | unwrap tokens_total [1h])
#   )

Watch out for a few things. Use spec.frontend.accessLog.attributes.add, not spec.config.logging.fields.add, they are different structures. Wrap integer llm.* values in string() with a default() fallback, as above. Do not add a has(llm) filter guard; it compile-panics the proxy, so leave the filter field off and non-LLM requests simply show n/a. The llm.* token variables live in the logging and metrics context only, not in rate-limit or transformation policies, so this attribution is configured separately from the budget enforcement even though both key off the same JWT claim.

Try it: generate per-user data and view it in Grafana

The lab mints a JWT Secret per user (jwt-alice, jwt-bob, jwt-carol, plus jwt-dba / jwt-support) in the kagent namespace. You do not need the chat agents for this: point curl at the gateway and replay each user's token. alice and bob are on the dba team, so they share the dba 5,000/hour bucket and you will see some 429s once it is spent. carol and support are on the support team and keep flowing.

Bashreplay a few requests per user through the gateway
# forward the gateway so curl can reach it
kubectl --context kind-budgets -n agentgateway-system \
  port-forward svc/budgets-gateway 8090:80 &

# fire 5 requests as each user, using their minted JWT
for u in alice bob carol support; do
  TOK=$(kubectl --context kind-budgets -n kagent \
    get secret jwt-$u -o jsonpath='{.data.token}' | base64 -d)
  for i in $(seq 1 5); do
    curl -s -o /dev/null -X POST http://localhost:8090/v1/chat/completions \
      -H "Authorization: Bearer $TOK" -H "Content-Type: application/json" \
      -d '{"model":"mock-essay-7b","messages":[{"role":"user","content":"write an essay about databases"}]}'
  done
  echo "sent 5 requests as $u"
done
Bashlog in to Grafana and open the per-user panels
# forward Grafana (port-forward.sh also does this)
kubectl --context kind-budgets -n monitoring \
  port-forward svc/monitoring-grafana 3000:80 &

# open http://localhost:3000  →  log in  admin / admin
# →  dashboard "Per-Team LLM Token Budgets"
# →  scroll to the "Tokens by user", "Per-user LLM access log",
#     and "Tokens per user over time" panels at the bottom

Within a few seconds Promtail ships the new access lines to Loki and the per-user panels fill in, one row and one line per user_id, while the per-team budget panels above keep enforcing the bucket.

Talking points

Teardown

./scripts/quick.sh teardown

See also

Versions

Built and verified on:

Enterprise
Solo Enterprise for agentgatewayv2.3.4
Gateway APIv1.4.0