The short version. Every agentgateway component starts from a very small footprint and scales up with load, so tiny deployments stay tiny. At a realistic LLM load of 1,000 concurrent users each sending a request a second, the proxy sits at about 150 MB of memory and 100m of CPU. The control plane is separate and scales with how much configuration you have, not with traffic. The rest of this page breaks those two stories apart, shows how each one grows, and turns them into Kubernetes requests and limits you can drop into a manifest.
Two components, two scaling stories
Agentgateway ships as a few components with different scale characteristics. For capacity planning, two matter: the control plane and the proxy. They grow on different inputs, so you size them independently.
Control plane
scales with configuration, not traffic
Grows with the number of configurations (services, pods, routes,
policies), the rate of change, and the number of connected proxies.
A reservation of 100m CPU and 128 MB RAM is
enough for most clusters. For large clusters with thousands of routes
and services it can grow toward roughly 1 GB for every
10,000 resources. Run multiple replicas for high availability or
horizontal scaling.
Proxy (data plane)
scales with concurrent requests
Grows with the number of concurrent requests, the amount of configuration applied, the type of traffic (tiny "hello" messages versus million-token context windows), and which policies are in play. This is the component your per-1,000-request number describes. Run multiple replicas for high availability or horizontal scaling.
The proxy at a real LLM load
For a typical LLM consumption use case, here is what the proxy actually consumes at a measured reference load.
Reference load: 1,000 concurrent users, each sending 1 request per second, each request 5,000 tokens spread across 20 messages. That is roughly 1,000 requests per second of real LLM traffic flowing through one proxy.
That footprint increases proportionally as concurrent users, request volume, or request sizes go up. The chart below takes the measured point at 1,000 users and projects that proportional growth so you can read off a starting estimate for higher loads.
How to read it. The solid segment is the measured point at 1,000 concurrent users. The dashed line is the proportional projection beyond it. CPU tracks the same shape, around 100m per 1,000 users. Use it for a first estimate, then right-size from the real numbers once your own traffic is flowing, since request size and the policies you enable move the line.
The control plane scales with configuration
The control plane does not care about request volume. It grows with how
much you configure: services, pods, routes and policies. It starts at the
128 MB baseline and climbs toward roughly 1 GB
as you approach 10,000 resources.
How to read it. Typical LLM deployments live at the bottom-left of this line because they do not have many routes or configurations. The slope only matters once you are managing thousands of services and routes. Add replicas for availability rather than for capacity.
Route and configuration scale
Most LLM use cases do not have a large number of routes, but it is worth
knowing what configuration costs on the proxy side. The cleanest number
here comes from John Howard's public Gateway API benchmark: the
agentgateway proxy completes the 5,000-route test at 40 MB,
up from 4 MB at rest, so roughly 8 MB of memory
per 1,000 routes and services. Comparable data planes in the same test
start at 60 to 90 MB and grow to 1 to 2 GB.
Two things to be clear about. This is a configuration-scale measurement, not a traffic one: the benchmark drives plain HTTP requests at a fixed rate across the route table to measure footprint as routes grow, with no LLM, token or agent traffic involved. That keeps it separate from the 150 MB LLM figure earlier on this page. And the 40 MB is a data-plane number, the proxy itself, which is the component this section is about. The figures come from the benchmark's Part 2 run, the first to include agentgateway.
How to read it. This is the one chart drawn straight from independently reproducible third-party data rather than a projection. The agentgateway proxy runs along the bottom axis, going from 4 MB at rest to 40 MB at 5,000 routes. The shaded band is where the other tested data planes sit, starting at 60 to 90 MB and climbing into the 1 to 2 GB range over the same test.
What this means for planning. Configuration size and traffic size are separate budgets. A deployment with heavy LLM traffic but few routes is dominated by the proxy traffic line. A deployment with light traffic but a huge route table is dominated by the control plane and the per-route memory. Size for whichever one your workload actually stresses.
Recommended requests and limits
Solo's published guidance here is a mechanism plus a starting point, not
a lookup table keyed by request count. You set the proxy's CPU and memory
through the EnterpriseAgentgatewayParameters resource, begin
from the documented example, then right-size from what you actually
observe and let an autoscaler add replicas as concurrency grows. There is
no official per-concurrency sizing table to copy, so treat the values
below as starting points, not a ceiling.
| Component | Requests | Limits | Notes |
|---|---|---|---|
| Proxy documented starting point |
cpu 100mmemory 128Mi |
cpu 500mmemory 512Mi |
The values from Solo's EnterpriseAgentgatewayParameters example. At ~1,000 concurrent LLM users John measured ~150 MB / 100m, which already tops the 128Mi request, so raise the memory request above your own observed working set and lean on the autoscaler for concurrency. |
| Control plane | cpu 100mmemory 128Mi |
cpu 500mmemory up to 1Gi |
John's baseline is 100m / 128 MB. Raise the memory limit toward ~1 GB only as you approach roughly 10,000 configured resources. |
Memory is the dimension to watch, because several agentgateway features hold it for the life of a request: streaming responses, semantic cache size, custom prompt-guard webhooks, and long multi-turn tool calling all push the working set up. That is why the limit sits well above the steady-state number, and why you size from observed metrics rather than a fixed table.
Set proxy resources via EnterpriseAgentgatewayParameters
apiVersion: enterpriseagentgateway.solo.io/v1alpha1
kind: EnterpriseAgentgatewayParameters
metadata:
name: production-config
namespace: agentgateway-system
spec:
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
Scale out for HA: autoscaler + disruption budget
apiVersion: enterpriseagentgateway.solo.io/v1alpha1
kind: EnterpriseAgentgatewayParameters
metadata:
name: production-ha
namespace: agentgateway-system
spec:
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
horizontalPodAutoscaler:
spec:
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
podDisruptionBudget:
spec:
minAvailable: 1
Even the generous limits here are small on a standard Kubernetes worker
node, so the gateway co-locates comfortably and leaves headroom. As
concurrent users climb, the autoscaler adds proxy replicas rather than
growing one pod, and the PodDisruptionBudget keeps at least
one replica serving through rollouts and node drains.
Acknowledgment
With thanks
The sizing numbers on this page come from John Howard. The control-plane and LLM-traffic figures are from his agentgateway deployment-sizing guidance; the route-scale comparison draws on his public Gateway API benchmark. Thank you, John.
References
Benchmark, docs and adjacent demos
- Solo docs: customize the agentgateway deployment — the published
EnterpriseAgentgatewayParametersexamples for setting resource requests and limits, replica count, the horizontal pod autoscaler, and the pod disruption budget. - gateway-api-bench, route scale (Part 2) — John Howard's public methodology and numbers for data-plane memory as route counts climb. This is the run that first includes agentgateway; Part 1 is the earlier baseline.
- Virtual keys on agentgateway — API-key auth plus token-based rate limiting and observability, the per-user budget model that rides on top of this proxy.
- Claude Code on a non-Anthropic model — a runnable kind lab that puts real LLM traffic through an agentgateway proxy.