What a virtual key actually is
In LiteLLM and Portkey, a virtual key is just a token string you give to a person, a team, or an app. They paste it into whatever makes the request, a curl command, Claude Code, an internal service, and send it on every call. The gateway checks the string, works out who it belongs to, and applies that owner's rules: which models they can use, how fast they can call, how many tokens they get. Then it swaps in your real OpenAI or Anthropic key on the way out, so the person sending requests never sees the real key. One real key sits behind many of these strings, and you can switch any one of them off without touching the real key or anyone else.
So a virtual key is really doing three jobs at once: name who is calling, cap what they can spend, and keep the real provider key hidden. Hold on to those three jobs. They are exactly how agentgateway rebuilds the feature.
The two keys people confuse
Most of the confusion comes from the word "key" doing double duty. When a workshop tells you to "set up a key" for an LLM backend, that is the provider credential. It is not a virtual key. They live in different places and do opposite jobs.
The provider credential (gateway → LLM)
The real Anthropic or OpenAI key. It lives in a Secret and is
referenced from the backend at
spec.policies.auth.secretRef. agentgateway injects it
on the upstream hop on its way out to the provider, and the route
strips any client-supplied Authorization or
x-api-key header so callers never see or set it.
One credential, shared by everything routing through that backend. It carries no caller identity and no budget. It is a secret to be hidden, not a handle to be issued.
The virtual key (client → gateway)
The token string you give to a user, team, or app. They send it
in Authorization: Bearer on every request, the
gateway validates it, maps it to an identity, and enforces that
identity's budget and limits. It resolves to the provider
credential above without ever exposing it.
Many of these, one per consumer. Each carries identity and budget, and each is revocable on its own.
| Provider credential | Virtual key | |
|---|---|---|
| Who it authenticates | Gateway → LLM provider | Client → Gateway |
| Where it lives | backend.policies.auth.secretRef |
API-key Secret + apiKeyAuthentication |
| Carries identity? | No | Yes (metadata.user_id) |
| Carries budget / limits? | No | Yes (token-based rate limit, keyed by identity) |
| How many | One, shared by the backend | Many, one per consumer |
| Hides the provider key? | It is the provider key | Yes, the client never sees it |
How agentgateway delivers virtual keys
agentgateway builds virtual keys out of three first-class capabilities it already gives you: API-key authentication, token-based rate limiting, and per-key observability. Each is a standard policy you configure on its own, and together they produce the full virtual-key experience: a per-consumer key that identifies the caller, carries its own budget, and resolves to a hidden provider credential.
Composing it from these three pieces is the strength here. The auth, the budgets, and the metrics are independent, so you tune each one to the policy you actually want, layer them with the rest of agentgateway (JWT, prompt guards, model routing) in the same place, and keep everything in plain Kubernetes resources you already manage with GitOps. You get the virtual-key outcome without inheriting one vendor's fixed, all-in-one shape.
The three capabilities map straight onto the three jobs of a virtual key:
| Job of a virtual key | agentgateway capability | Where it is configured |
|---|---|---|
| Identify the caller | API-key authentication | traffic.apiKeyAuthentication on EnterpriseAgentgatewayPolicy, backed by a Secret of per-user keys + metadata |
| Cap what they spend | Token-based rate limiting | traffic.rateLimit.global, descriptor keyed on the API key's metadata |
| Track per-key usage | Observability metrics | agentgateway_gen_ai_client_token_usage in Prometheus, broken down by user_id |
How a request flows
- A request arrives with a virtual key in
Authorization: Bearer. - agentgateway validates the key against the API-key Secret. An invalid key is rejected before anything else runs, so it never touches a budget.
- The caller's
user_idis read from that key's metadata. - The request is checked against that user's token budget on the rate-limit server.
- If budget remains, the request goes upstream. agentgateway injects the hidden provider credential and strips the client's inbound auth header.
- Token usage from the response is deducted from the user's budget and recorded in metrics.
- If the budget is exhausted, the request is rejected with
429 Too Many Requests. - Budgets refill on the configured interval: daily, hourly, whatever you set.
Mind the evaluation order
Authentication runs before rate limiting, so an unauthenticated
request never consumes quota. But rate limiting runs before
prompt guards. A request that is later blocked by a content guard
with a 403 has already drawn down the user's token
budget. Worth knowing before you reason about why a budget moved on
a request that "failed".
Build it: per-user budgets
This is the smallest complete setup: two virtual keys, for Alice and
Bob, each with an independent daily budget of 100,000 tokens. Every
field below is from the current
EnterpriseAgentgatewayPolicy schema.
1. Issue the virtual keys
Each entry in stringData is one virtual key. The
key is the bearer token the client sends. The
metadata is what the gateway uses downstream to
identify and meter the caller.
apiVersion: v1
kind: Secret
metadata:
name: llm-api-keys
namespace: agentgateway-system
type: Opaque
stringData:
alice: |
{
"key": "sk-alice-abc123def456",
"metadata": { "user_id": "alice" }
}
bob: |
{
"key": "sk-bob-xyz789uvw012",
"metadata": { "user_id": "bob" }
}
2. Require the virtual key
ENT
Attach API-key authentication to the Gateway so every route demands
a valid key. mode: Strict rejects anything without one.
apiVersion: enterpriseagentgateway.solo.io/v1alpha1
kind: EnterpriseAgentgatewayPolicy
metadata:
name: api-key-auth
namespace: agentgateway-system
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: agentgateway-proxy
traffic:
apiKeyAuthentication:
mode: Strict
secretRef:
name: llm-api-keys
3. Attach the per-key budget
ENT
The descriptor entry pulls user_id out of the API key's
metadata with a CEL expression, and unit: Tokens meters
tokens rather than requests. Each user gets their own bucket.
apiVersion: enterpriseagentgateway.solo.io/v1alpha1
kind: EnterpriseAgentgatewayPolicy
metadata:
name: daily-token-budget
namespace: agentgateway-system
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: agentgateway-proxy
traffic:
rateLimit:
global:
domain: token-budgets
backendRef:
kind: Service
name: rate-limit-server
namespace: agentgateway-system
port: 8081
descriptors:
- entries:
- name: user_id
expression: 'apiKey.metadata.user_id'
unit: Tokens
4. Set the actual limit on the rate-limit server
The policy says "meter per user_id in tokens"; the
rate-limit server holds the number. domain and the
descriptor key must match the policy above.
apiVersion: v1
kind: ConfigMap
metadata:
name: rate-limit-config
namespace: agentgateway-system
data:
config.yaml: |
domain: token-budgets
descriptors:
- key: user_id
rate_limit:
unit: day
requests_per_unit: 100000 # 100k tokens/day per user
5. The backend, where the provider credential hides
This is the other key. openai-secret holds the real
provider credential and never leaves the gateway.
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayBackend
metadata:
name: openai
namespace: agentgateway-system
spec:
ai:
provider:
openai:
model: gpt-3.5-turbo
policies:
auth:
secretRef:
name: openai-secret
An HTTPRoute sending /openai to that backend
finishes the wiring. Then it behaves exactly like a virtual-key
system: Alice's key works until her 100,000 tokens are gone and she
starts getting 429s, while Bob keeps going on his own
untouched budget.
# Alice's virtual key
curl "$INGRESS_GW_ADDRESS/openai" \
-H "Authorization: Bearer sk-alice-abc123def456" \
-H "Content-Type: application/json" \
-d '{"model":"gpt-3.5-turbo","messages":[{"role":"user","content":"Hello!"}]}'
# Once her budget is spent:
# HTTP/1.1 429 Too Many Requests
# x-ratelimit-limit: 100000
# x-ratelimit-remaining: 0
Tiered budgets by user type
Because identity is just metadata on the key, you scale budgets by
adding a field. Tag each key with a tier, then add it as
a second descriptor entry. Free, standard, and premium users now draw
from different daily limits.
Add the tier to the key, then to the descriptor
# In the Secret
alice: |
{ "key": "sk-alice-abc123def456",
"metadata": { "user_id": "alice", "tier": "premium" } }
# In the rateLimit descriptor
descriptors:
- entries:
- name: tier
expression: 'apiKey.metadata.tier'
- name: user_id
expression: 'apiKey.metadata.user_id'
unit: Tokens
Tier limits on the rate-limit server
domain: token-budgets
descriptors:
- key: tier
value: "free"
descriptors:
- key: user_id
rate_limit: { unit: day, requests_per_unit: 10000 }
- key: tier
value: "premium"
descriptors:
- key: user_id
rate_limit: { unit: day, requests_per_unit: 500000 }
Multi-tenant virtual keys
The same move scopes keys to a tenant as well as a user. Add a
tenant_id to the key metadata and lead the descriptor
with it, so every user's budget is nested under their tenant.
# In the rateLimit descriptor
descriptors:
- entries:
- name: tenant_id
expression: 'apiKey.metadata.tenant_id'
- name: user_id
expression: 'apiKey.metadata.user_id'
unit: Tokens
Shorter budget windows
The refresh interval lives on the rate-limit server, not the policy.
Drop unit from day to hour (or
minute, second) for tighter control, so a
runaway key can only burn its allowance for an hour before it
refills rather than a whole day.
# In the rate-limit-config ConfigMap
domain: token-budgets
descriptors:
- key: user_id
rate_limit:
unit: hour
requests_per_unit: 10000 # 10k tokens/hour per user
Track per-key spend
The third job, knowing what each key actually costs, comes from the
token-usage metric, broken down by the same user_id.
agentgateway records input and output tokens per request, so you can
total a user's daily consumption and turn it into money with your
provider's pricing.
# Total tokens per user over the last 24h
sum by (user_id) (
increase(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type="input"}[24h]) +
increase(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type="output"}[24h])
)
# Percentage of a 100k daily budget used, per user
(sum by (user_id) (
increase(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type="input"}[24h]) +
increase(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type="output"}[24h])
) / 100000) * 100
# Cost per user (example: $0.50 / 1M input, $1.50 / 1M output)
sum by (user_id) (
((rate(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type="input"}[24h]) / 1e6) * 0.50) +
((rate(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type="output"}[24h]) / 1e6) * 1.50)
)
One policy area per policy
If two EnterpriseAgentgatewayPolicy resources target the
same Gateway or route with overlapping backend.ai
fields, one silently overwrites the other based on creation order,
and both still report ACCEPTED and ATTACHED.
Keep auth, rate limiting, and guards in separate policies, as the
examples above do, so they compose instead of clobbering each other.
Checklist
Standing up virtual keys
- Two different keys: the provider credential hides at
backend.policies.auth.secretRef, the virtual key is the client-facing handle in the API-key Secret. Don't conflate them. - Identity rides on key metadata.
metadata.user_idis what the rate limiter and the metrics both key on, pulled out withapiKey.metadata.user_idin CEL. - Budgets need both halves: the
rateLimitpolicy declares what to meter (unit: Tokens, keyed by user), the rate-limit-server ConfigMap holds how much. Thedomainand descriptorkeymust match across the two. - Use
mode: StrictonapiKeyAuthenticationso a missing key is rejected, not waved through. - Rate limiting runs before prompt guards, so a request blocked by a guard has still spent budget. Authentication runs before rate limiting, so a bad key spends nothing.
- Scale by adding metadata, not new machinery: a
tierfield gives you tiered budgets, atenant_idfield gives you multi-tenant keys, both as extra descriptor entries. - Split auth, rate limiting, and guards into separate policies. Overlapping
backend.aifields in two policies silently overwrite by creation order while both show healthy status. - Keep metric label cardinality sane.
user_idis fine for spend tracking; resist adding high-cardinality labels that will overwhelm Prometheus.