Sizing an agentgateway deployment — CPU, memory and request/limit guidance, by Tom O'Rourke

The short version. Every agentgateway component starts from a very small footprint and scales up with load, so tiny deployments stay tiny. At a realistic LLM load of 1,000 concurrent users each sending a request a second, the proxy sits at about 150 MB of memory and 100m of CPU. The control plane is separate and scales with how much configuration you have, not with traffic. The rest of this page breaks those two stories apart, shows how each one grows, and turns them into Kubernetes requests and limits you can drop into a manifest.

Two components, two scaling stories

Agentgateway ships as a few components with different scale characteristics. For capacity planning, two matter: the control plane and the proxy. They grow on different inputs, so you size them independently.

Control plane

scales with configuration, not traffic

Grows with the number of configurations (services, pods, routes, policies), the rate of change, and the number of connected proxies. A reservation of 100m CPU and 128 MB RAM is enough for most clusters. For large clusters with thousands of routes and services it can grow toward roughly 1 GB for every 10,000 resources. Run multiple replicas for high availability or horizontal scaling.

Proxy (data plane)

scales with concurrent requests

Grows with the number of concurrent requests, the amount of configuration applied, the type of traffic (tiny "hello" messages versus million-token context windows), and which policies are in play. This is the component your per-1,000-request number describes. Run multiple replicas for high availability or horizontal scaling.

The proxy at a real LLM load

For a typical LLM consumption use case, here is what the proxy actually consumes at a measured reference load.

Reference load: 1,000 concurrent users, each sending 1 request per second, each request 5,000 tokens spread across 20 messages. That is roughly 1,000 requests per second of real LLM traffic flowing through one proxy.

150 MB

Memory

per proxy replica

100m

CPU

a tenth of a core

That footprint increases proportionally as concurrent users, request volume, or request sizes go up. The chart below takes the measured point at 1,000 users and projects that proportional growth so you can read off a starting estimate for higher loads.

How to read it. The solid segment is the measured point at 1,000 concurrent users. The dashed line is the proportional projection beyond it. CPU tracks the same shape, around 100m per 1,000 users. Use it for a first estimate, then right-size from the real numbers once your own traffic is flowing, since request size and the policies you enable move the line.

The control plane scales with configuration

The control plane does not care about request volume. It grows with how much you configure: services, pods, routes and policies. It starts at the 128 MB baseline and climbs toward roughly 1 GB as you approach 10,000 resources.

How to read it. Typical LLM deployments live at the bottom-left of this line because they do not have many routes or configurations. The slope only matters once you are managing thousands of services and routes. Add replicas for availability rather than for capacity.

Route and configuration scale

Most LLM use cases do not have a large number of routes, but it is worth knowing what configuration costs on the proxy side. The cleanest number here comes from John Howard's public Gateway API benchmark: the agentgateway proxy completes the 5,000-route test at 40 MB, up from 4 MB at rest, so roughly 8 MB of memory per 1,000 routes and services. Comparable data planes in the same test start at 60 to 90 MB and grow to 1 to 2 GB.

Two things to be clear about. This is a configuration-scale measurement, not a traffic one: the benchmark drives plain HTTP requests at a fixed rate across the route table to measure footprint as routes grow, with no LLM, token or agent traffic involved. That keeps it separate from the 150 MB LLM figure earlier on this page. And the 40 MB is a data-plane number, the proxy itself, which is the component this section is about. The figures come from the benchmark's Part 2 run, the first to include agentgateway.

How to read it. This is the one chart drawn straight from independently reproducible third-party data rather than a projection. The agentgateway proxy runs along the bottom axis, going from 4 MB at rest to 40 MB at 5,000 routes. The shaded band is where the other tested data planes sit, starting at 60 to 90 MB and climbing into the 1 to 2 GB range over the same test.

What this means for planning. Configuration size and traffic size are separate budgets. A deployment with heavy LLM traffic but few routes is dominated by the proxy traffic line. A deployment with light traffic but a huge route table is dominated by the control plane and the per-route memory. Size for whichever one your workload actually stresses.

Recommended requests and limits

Solo's published guidance here is a mechanism plus a starting point, not a lookup table keyed by request count. You set the proxy's CPU and memory through the EnterpriseAgentgatewayParameters resource, begin from the documented example, then right-size from what you actually observe and let an autoscaler add replicas as concurrency grows. There is no official per-concurrency sizing table to copy, so treat the values below as starting points, not a ceiling.

Component	Requests	Limits	Notes
Proxy documented starting point	cpu `100m` memory `128Mi`	cpu `500m` memory `512Mi`	The values from Solo's `EnterpriseAgentgatewayParameters` example. At ~1,000 concurrent LLM users John measured ~150 MB / 100m, which already tops the 128Mi request, so raise the memory request above your own observed working set and lean on the autoscaler for concurrency.
Control plane	cpu `100m` memory `128Mi`	cpu `500m` memory up to `1Gi`	John's baseline is 100m / 128 MB. Raise the memory limit toward ~1 GB only as you approach roughly 10,000 configured resources.

Memory is the dimension to watch, because several agentgateway features hold it for the life of a request: streaming responses, semantic cache size, custom prompt-guard webhooks, and long multi-turn tool calling all push the working set up. That is why the limit sits well above the steady-state number, and why you size from observed metrics rather than a fixed table.

Set proxy resources via EnterpriseAgentgatewayParameters

apiVersion: enterpriseagentgateway.solo.io/v1alpha1
kind: EnterpriseAgentgatewayParameters
metadata:
  name: production-config
  namespace: agentgateway-system
spec:
  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      cpu: 500m
      memory: 512Mi

Scale out for HA: autoscaler + disruption budget

apiVersion: enterpriseagentgateway.solo.io/v1alpha1
kind: EnterpriseAgentgatewayParameters
metadata:
  name: production-ha
  namespace: agentgateway-system
spec:
  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      cpu: 500m
      memory: 512Mi
  horizontalPodAutoscaler:
    spec:
      minReplicas: 2
      maxReplicas: 10
      metrics:
        - type: Resource
          resource:
            name: cpu
            target:
              type: Utilization
              averageUtilization: 80
  podDisruptionBudget:
    spec:
      minAvailable: 1

Even the generous limits here are small on a standard Kubernetes worker node, so the gateway co-locates comfortably and leaves headroom. As concurrent users climb, the autoscaler adds proxy replicas rather than growing one pod, and the PodDisruptionBudget keeps at least one replica serving through rollouts and node drains.

Acknowledgment

With thanks

The sizing numbers on this page come from John Howard. The control-plane and LLM-traffic figures are from his agentgateway deployment-sizing guidance; the route-scale comparison draws on his public Gateway API benchmark. Thank you, John.

References

Benchmark, docs and adjacent demos

Solo docs: customize the agentgateway deployment — the published EnterpriseAgentgatewayParameters examples for setting resource requests and limits, replica count, the horizontal pod autoscaler, and the pod disruption budget.
gateway-api-bench, route scale (Part 2) — John Howard's public methodology and numbers for data-plane memory as route counts climb. This is the run that first includes agentgateway; Part 1 is the earlier baseline.
Virtual keys on agentgateway — API-key auth plus token-based rate limiting and observability, the per-user budget model that rides on top of this proxy.
Claude Code on a non-Anthropic model — a runnable kind lab that puts real LLM traffic through an agentgateway proxy.