MastertheMesh
Solo · agentgateway
Sizing guide

Sizing an agentgateway deployment

TO
Tom O'Rourke
EMEA Field CTO · Solo.io

Two questions come up on every capacity plan: how much CPU and memory does agentgateway need per 1,000 concurrent requests, and what should I put in the pod's requests and limits? The honest answer starts with how agentgateway is built. It has two components that scale on completely different inputs, and once you separate them the numbers are small and easy to reason about.

agentgateway Capacity planning CPU / memory Requests / limits LLM traffic

The short version. Every agentgateway component starts from a very small footprint and scales up with load, so tiny deployments stay tiny. At a realistic LLM load of 1,000 concurrent users each sending a request a second, the proxy sits at about 150 MB of memory and 100m of CPU. The control plane is separate and scales with how much configuration you have, not with traffic. The rest of this page breaks those two stories apart, shows how each one grows, and turns them into Kubernetes requests and limits you can drop into a manifest.

Two components, two scaling stories

Agentgateway ships as a few components with different scale characteristics. For capacity planning, two matter: the control plane and the proxy. They grow on different inputs, so you size them independently.

Control plane

scales with configuration, not traffic

Grows with the number of configurations (services, pods, routes, policies), the rate of change, and the number of connected proxies. A reservation of 100m CPU and 128 MB RAM is enough for most clusters. For large clusters with thousands of routes and services it can grow toward roughly 1 GB for every 10,000 resources. Run multiple replicas for high availability or horizontal scaling.

Proxy (data plane)

scales with concurrent requests

Grows with the number of concurrent requests, the amount of configuration applied, the type of traffic (tiny "hello" messages versus million-token context windows), and which policies are in play. This is the component your per-1,000-request number describes. Run multiple replicas for high availability or horizontal scaling.

The proxy at a real LLM load

For a typical LLM consumption use case, here is what the proxy actually consumes at a measured reference load.

Reference load: 1,000 concurrent users, each sending 1 request per second, each request 5,000 tokens spread across 20 messages. That is roughly 1,000 requests per second of real LLM traffic flowing through one proxy.

150 MB
Memory
per proxy replica
100m
CPU
a tenth of a core

That footprint increases proportionally as concurrent users, request volume, or request sizes go up. The chart below takes the measured point at 1,000 users and projects that proportional growth so you can read off a starting estimate for higher loads.

Proxy memory vs concurrent LLM users 0 150 300 450 600 Memory (MB) 0 1,000 2,000 3,000 4,000 Concurrent users (1 request/sec each) measured: 1,000 users 150 MB · 100m CPU ~600 MB projected

How to read it. The solid segment is the measured point at 1,000 concurrent users. The dashed line is the proportional projection beyond it. CPU tracks the same shape, around 100m per 1,000 users. Use it for a first estimate, then right-size from the real numbers once your own traffic is flowing, since request size and the policies you enable move the line.

The control plane scales with configuration

The control plane does not care about request volume. It grows with how much you configure: services, pods, routes and policies. It starts at the 128 MB baseline and climbs toward roughly 1 GB as you approach 10,000 resources.

Control-plane memory vs configured resources 0 256 512 768 1024 Memory (MB) 0 2,500 5,000 7,500 10,000 Configured resources (services, routes, policies) 128 MB baseline most clusters · 100m CPU ~1 GB at 10,000 resources large clusters

How to read it. Typical LLM deployments live at the bottom-left of this line because they do not have many routes or configurations. The slope only matters once you are managing thousands of services and routes. Add replicas for availability rather than for capacity.

Route and configuration scale

Most LLM use cases do not have a large number of routes, but it is worth knowing what configuration costs on the proxy side. The cleanest number here comes from John Howard's public Gateway API benchmark: the agentgateway proxy completes the 5,000-route test at 40 MB, up from 4 MB at rest, so roughly 8 MB of memory per 1,000 routes and services. Comparable data planes in the same test start at 60 to 90 MB and grow to 1 to 2 GB.

Two things to be clear about. This is a configuration-scale measurement, not a traffic one: the benchmark drives plain HTTP requests at a fixed rate across the route table to measure footprint as routes grow, with no LLM, token or agent traffic involved. That keeps it separate from the 150 MB LLM figure earlier on this page. And the 40 MB is a data-plane number, the proxy itself, which is the component this section is about. The figures come from the benchmark's Part 2 run, the first to include agentgateway.

Data-plane memory vs route count (measured, gateway-api-bench Part 2) 0 500 1000 1500 2000 Memory (MB) 0 1,000 2,500 5,000 Routes and services configured other data planes start 60–90 MB · grow to 1–2 GB agentgateway 4 MB → 40 MB at 5,000 routes

How to read it. This is the one chart drawn straight from independently reproducible third-party data rather than a projection. The agentgateway proxy runs along the bottom axis, going from 4 MB at rest to 40 MB at 5,000 routes. The shaded band is where the other tested data planes sit, starting at 60 to 90 MB and climbing into the 1 to 2 GB range over the same test.

What this means for planning. Configuration size and traffic size are separate budgets. A deployment with heavy LLM traffic but few routes is dominated by the proxy traffic line. A deployment with light traffic but a huge route table is dominated by the control plane and the per-route memory. Size for whichever one your workload actually stresses.

Recommended requests and limits

Solo's published guidance here is a mechanism plus a starting point, not a lookup table keyed by request count. You set the proxy's CPU and memory through the EnterpriseAgentgatewayParameters resource, begin from the documented example, then right-size from what you actually observe and let an autoscaler add replicas as concurrency grows. There is no official per-concurrency sizing table to copy, so treat the values below as starting points, not a ceiling.

Component Requests Limits Notes
Proxy
documented starting point
cpu 100m
memory 128Mi
cpu 500m
memory 512Mi
The values from Solo's EnterpriseAgentgatewayParameters example. At ~1,000 concurrent LLM users John measured ~150 MB / 100m, which already tops the 128Mi request, so raise the memory request above your own observed working set and lean on the autoscaler for concurrency.
Control plane cpu 100m
memory 128Mi
cpu 500m
memory up to 1Gi
John's baseline is 100m / 128 MB. Raise the memory limit toward ~1 GB only as you approach roughly 10,000 configured resources.

Memory is the dimension to watch, because several agentgateway features hold it for the life of a request: streaming responses, semantic cache size, custom prompt-guard webhooks, and long multi-turn tool calling all push the working set up. That is why the limit sits well above the steady-state number, and why you size from observed metrics rather than a fixed table.

Set proxy resources via EnterpriseAgentgatewayParameters
apiVersion: enterpriseagentgateway.solo.io/v1alpha1
kind: EnterpriseAgentgatewayParameters
metadata:
  name: production-config
  namespace: agentgateway-system
spec:
  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      cpu: 500m
      memory: 512Mi
Scale out for HA: autoscaler + disruption budget
apiVersion: enterpriseagentgateway.solo.io/v1alpha1
kind: EnterpriseAgentgatewayParameters
metadata:
  name: production-ha
  namespace: agentgateway-system
spec:
  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      cpu: 500m
      memory: 512Mi
  horizontalPodAutoscaler:
    spec:
      minReplicas: 2
      maxReplicas: 10
      metrics:
        - type: Resource
          resource:
            name: cpu
            target:
              type: Utilization
              averageUtilization: 80
  podDisruptionBudget:
    spec:
      minAvailable: 1

Even the generous limits here are small on a standard Kubernetes worker node, so the gateway co-locates comfortably and leaves headroom. As concurrent users climb, the autoscaler adds proxy replicas rather than growing one pod, and the PodDisruptionBudget keeps at least one replica serving through rollouts and node drains.

Acknowledgment

With thanks

The sizing numbers on this page come from John Howard. The control-plane and LLM-traffic figures are from his agentgateway deployment-sizing guidance; the route-scale comparison draws on his public Gateway API benchmark. Thank you, John.

References

Benchmark, docs and adjacent demos