What the Semantic Router is, and where agentgateway fits
The vLLM Semantic Router is an open-source routing layer for LLM traffic, from the vLLM project. Instead of every request hitting one fixed model, it inspects each prompt and decides, in real time, which model should answer it — a cheap local model for routine questions, a specialised model or LoRA adapter for a domain like maths or law, a frontier model for genuinely hard reasoning. Along the way it can detect personally identifiable information, intercept jailbreak attempts, and serve semantically-similar answers from cache.
agentgateway is
the Kubernetes Gateway API proxy that sits in front of your models. It
already terminates the OpenAI-compatible API your clients call,
authenticates them, and applies policy. The integration wires the
Semantic Router into that data path as an External Processing
(ExtProc) service: agentgateway streams each request out to the
router, the router classifies it and rewrites the request body to name
the chosen model, and agentgateway forwards the mutated request to the
backend. The router decides; the gateway enforces. Your application code
never changes — it keeps sending "model": "auto" to a
single endpoint.
Why this matters
Modern deployments don't have a model — they have a fleet. Models now differ on quality, cost, latency, privacy, and modality, and the moment you run more than one, the routing decision is the product. Hard-coding a model name in the client is the thing you regret six months later: you can't shift traffic to a cheaper model, you can't send sensitive prompts to an on-prem model, and every change is a client redeploy. Pushing the decision into a routing layer at the gateway turns all of that into config. Four reasons it's worth doing:
Cost
Most traffic is routine and doesn't need your most expensive model. Route the easy 80% to an efficient local or small model and reserve frontier models for the prompts that actually need them. The project cites research on order-of-magnitude reductions in effective inference cost from exactly this kind of signal-driven routing.
Accuracy
A prompt classified as “maths” can be sent to a maths-tuned model or LoRA adapter; a legal question to a legal one. Matching the request to a specialist beats forcing one generalist to cover every domain — better answers without a bigger, costlier model.
Safety & privacy
The router detects PII and jailbreak / prompt-injection attempts before the request reaches a model or leaves your boundary. Sensitive prompts can be kept on an in-cluster model; obvious attacks can be blocked at the edge rather than relied on the model to refuse.
Zero client change
Because the routing runs as ExtProc at the gateway, clients keep calling one OpenAI-compatible endpoint with "model": "auto". Teams don't hard-code model names, and you can change routing strategy centrally without touching a single application.
Put differently: the gateway is already the one place every LLM request passes through, already authenticated and observable. That's the natural home for a routing brain. The signals the router extracts on the way through — domain, safety, similarity — are the same signals you'd want for cost control, governance, and audit anyway.
How it works: ExtProc in the request path
ExtProc is the agentgateway mechanism for handing a request out to an external service that can mutate it mid-flight — not just allow or deny it, but rewrite the headers and body before the gateway forwards it on. (It's the same family of extension point covered in the CEL vs OPA vs ext-authz vs ext-proc post; the Semantic Router is a textbook use of it.) Here the external service is the router, reachable over gRPC, and agentgateway is told to buffer the request and response bodies so the router can see and rewrite the prompt.
The router decides, the gateway enforces — clients send model: auto to one endpoint and never change.
The key design choice is on the backend object: the
openai.model field is deliberately left unset. That tells
agentgateway to use whatever model name is in the request body —
which, by the time the backend sees it, is the model the router chose.
The router holds the intelligence; the gateway and backend just honour
its decision.
Where it earns its keep: enterprise use cases
The pattern is general, but a few scenarios show why teams reach for it. These are illustrative, framed by capability rather than tied to any one organisation.
Specialist routing with data kept in-boundary
An internal assistant fields everything from “reset my password” to regulatory-reporting questions. Routine queries go to a small in-cluster model; compliance and risk questions go to a domain-tuned adapter that answers them well. The router's PII detection flags account numbers and client identifiers, so prompts carrying them are routed to an on-prem model instead of an external frontier API.
Win: better domain answers, and sensitive data never leaves the cluster — both enforced at the gateway, not trusted to each app.
Cutting the bill on a high-volume chat feature
A product's in-app assistant serves millions of messages a month, and the bulk are FAQ-style or short follow-ups that a small model handles perfectly. Before, everything hit a frontier model “to be safe.” With semantic routing, the easy majority drop to a cheap local model and only genuinely complex prompts escalate — with a semantic cache absorbing near-duplicate questions on top.
Win: a large cut in inference spend with no measurable drop in answer quality, and no client release to ship it.
PII / PHI containment as an infrastructure control
Clinicians use a general assistant, but patient identifiers must never reach an external model. The router's token-level PII detection runs on every prompt at the gateway; anything carrying identifiers is kept on an in-cluster model, while de-identified general questions can use a larger external one. The control lives in one place and applies to every client uniformly.
Win: data-residency and PHI rules enforced centrally and auditable, instead of depending on every team's prompt hygiene.
One endpoint, many models, governed centrally
A central team offers “the LLM endpoint” to dozens of
internal apps. They don't want every team pinning model names or
re-implementing jailbreak filters. Apps call one endpoint with
"model": "auto"; the platform team owns the routing table,
the safety classifiers, and the cost policy in one place — and can
swap models or shift traffic without a single downstream change.
Win: model choice, safety, and cost become platform policy, not scattered application code.
Prerequisites
You need a Kubernetes cluster and the usual tooling. The one version
constraint to note: agentgateway must be recent enough to support the
ExtProc processingOptions and allowModeOverride
fields the router relies on.
- kubectl and Helm
- kind — optional, only if you want a local throwaway cluster
- agentgateway v1.3.0-alpha.1 or newer — required for ExtProc
processingOptions+allowModeOverride
- Open-source agentgateway — API group
agentgateway.dev/v1alpha1, kindsAgentgatewayBackend,AgentgatewayPolicy,AgentgatewayParameters. This is what the vLLM docs and every manifest below use. - Solo Enterprise for agentgateway — API group
enterpriseagentgateway.solo.io/v1alpha1, kindsEnterpriseAgentgatewayBackend,EnterpriseAgentgatewayPolicy,EnterpriseAgentgatewayParameters(the family used in Eight enterprise controls for MCP traffic).
apiVersion and kind differ, and field
availability can vary by version — the
processingOptions/allowModeOverride fields used
here landed in the v1.3.0-alpha.1 line. If you're on enterprise, translate
the kinds and confirm the fields against your installed CRDs before
applying.
Deploy it, step by step
Seven steps: a cluster, agentgateway, a gateway proxy, a demo model, the router, the routing resources, and the ExtProc attachment. Commands and manifests below mirror the official docs — reach for those if a version has moved on.
Create a local cluster (optional)
Skip if you already have a cluster.
kind create cluster --name semantic-router-agentgateway
kubectl wait --for=condition=Ready nodes --all --timeout=300s
Install agentgateway
Install the Gateway API CRDs, then the agentgateway CRDs and controller. Experimental Gateway API features are enabled because ExtProc rides on them.
export AGENTGATEWAY_VERSION=v1.3.0-alpha.1
kubectl apply --server-side --force-conflicts \
-f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.5.0/standard-install.yaml
helm upgrade -i agentgateway-crds oci://cr.agentgateway.dev/charts/agentgateway-crds \
--create-namespace \
--namespace agentgateway-system \
--version "${AGENTGATEWAY_VERSION}" \
--set controller.image.pullPolicy=Always
helm upgrade -i agentgateway oci://cr.agentgateway.dev/charts/agentgateway \
--namespace agentgateway-system \
--version "${AGENTGATEWAY_VERSION}" \
--set controller.image.pullPolicy=Always \
--set controller.extraEnv.KGW_ENABLE_GATEWAY_API_EXPERIMENTAL_FEATURES=true \
--wait
kubectl get pods -n agentgateway-system
Create the gateway proxy
A standard Gateway API Gateway on the agentgateway class, listening on port 80.
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: agentgateway-proxy
namespace: agentgateway-system
spec:
gatewayClassName: agentgateway
listeners:
- protocol: HTTP
port: 80
name: http
allowedRoutes:
namespaces:
from: All
kubectl apply -f gateway.yaml
kubectl wait --for=condition=Available deployment/agentgateway-proxy \
-n agentgateway-system --timeout=300s
Deploy a demo vLLM-compatible backend
An OpenAI-compatible simulator serving a base model plus six LoRA adapters (math, science, social, humanities, law, general) — the specialists the router will route to. The full Deployment + Service is in the docs; the shape is:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama3-8b-instruct
namespace: default
spec:
replicas: 1
selector:
matchLabels: { app: vllm-llama3-8b-instruct }
template:
metadata:
labels: { app: vllm-llama3-8b-instruct }
spec:
containers:
- name: vllm-sim
image: ghcr.io/llm-d/llm-d-inference-sim:v0.5.0
args:
- --model
- base-model
- --port
- "8000"
- --max-loras
- "6"
- --lora-modules
- '{"name": "math-expert"}'
- '{"name": "science-expert"}'
- '{"name": "social-expert"}'
- '{"name": "humanities-expert"}'
- '{"name": "law-expert"}'
- '{"name": "general-expert"}'
ports:
- { containerPort: 8000, name: http }
readinessProbe:
httpGet: { path: /health, port: http }
---
apiVersion: v1
kind: Service
metadata:
name: vllm-llama3-8b-instruct
namespace: default
spec:
selector: { app: vllm-llama3-8b-instruct }
ports:
- { port: 8000, targetPort: 8000 }
kubectl wait --for=condition=Available deployment/vllm-llama3-8b-instruct \
-n default --timeout=300s
Deploy the Semantic Router
Install the router via Helm with the agentgateway-specific values file, which points it at the demo backend and configures the LoRA adapter selection.
helm install semantic-router oci://ghcr.io/vllm-project/charts/semantic-router \
--version v0.0.0-latest \
--namespace agentgateway-system \
-f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/agentgateway/semantic-router-values/values.yaml
kubectl wait --for=condition=Available deployment/semantic-router \
-n agentgateway-system --timeout=600s
SemanticRouter
custom resource (vllm.ai/v1alpha1). It reconciles the
Deployment, the Service (gRPC 50051, HTTP 8080,
metrics 9190), the ConfigMap, a PVC for model storage, an
optional HorizontalPodAutoscaler, and RBAC — and it auto-detects
OpenShift vs vanilla Kubernetes to set security contexts. Reach for it when
you're on OpenShift or want CRD-driven lifecycle management rather than
Helm. Install it and apply a router instance:
git clone https://github.com/vllm-project/semantic-router
cd semantic-router/deploy/operator
make install
make deploy IMG=ghcr.io/vllm-project/semantic-router-operator:latest
apiVersion: vllm.ai/v1alpha1
kind: SemanticRouter
metadata:
name: semantic-router
namespace: agentgateway-system
spec:
replicas: 2
image:
repository: ghcr.io/vllm-project/semantic-router/extproc
tag: latest
vllmEndpoints:
- name: llama3-8b-endpoint
model: llama3-8b
backend:
type: kserve
inferenceServiceName: llama-3-8b
weight: 1
resources:
requests: { memory: "3Gi", cpu: "1" }
limits: { memory: "7Gi", cpu: "2" }
This swaps out only the router deployment — Steps 1–4
and 6–7 are unchanged. Two things to reconcile: point
vllmEndpoints at the backend you're actually serving (the CR's
routing config differs from the agentgateway-specific Helm values used
above), and make sure the ExtProc backendRef in Step 7 targets
the Service name, namespace, and gRPC port (50051) that the
operator creates from this CR.
Create the routing resources
An AgentgatewayBackend pointing at the vLLM service, and an HTTPRoute binding it to the gateway. Note the backend omits openai.model on purpose — so the model name the router writes into the request body is the one that's used.
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayBackend
metadata:
name: semantic-router-vllm
namespace: agentgateway-system
spec:
ai:
provider:
openai: {} # model intentionally omitted
host: vllm-llama3-8b-instruct.default.svc.cluster.local
port: 8000
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: semantic-router-vllm
namespace: agentgateway-system
spec:
parentRefs:
- name: agentgateway-proxy
namespace: agentgateway-system
rules:
- backendRefs:
- name: semantic-router-vllm
namespace: agentgateway-system
group: agentgateway.dev
kind: AgentgatewayBackend
Attach the Semantic Router as ExtProc
An AgentgatewayPolicy targeting the gateway, sending request and response bodies to the router (buffered) and allowing it to override the processing mode. The processingOptions and allowModeOverride fields are the reason this integration needs agentgateway v1.3.0-alpha.1 or newer. For large prompts, switch requestBodyMode to FullDuplexStreamed with the matching router config.
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
name: semantic-router-extproc
namespace: agentgateway-system
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: agentgateway-proxy
traffic:
extProc:
backendRef:
name: semantic-router
namespace: agentgateway-system
port: 50051
processingOptions:
requestHeaderMode: Send
requestBodyMode: Buffered
responseHeaderMode: Send
responseBodyMode: Buffered
allowModeOverride: true
Send a request through it
Port-forward the gateway and send an OpenAI-style request with
"model": "auto". The router classifies the maths prompt,
selects the maths route, and rewrites the request before agentgateway
forwards it — the client never names a concrete model.
kubectl port-forward -n agentgateway-system svc/agentgateway-proxy 8080:80
curl -i -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [
{"role": "user", "content": "What is the derivative of f(x) = x^3?"}
],
"max_tokens": 64,
"temperature": 0
}'
Troubleshooting
If a request doesn't route as expected, walk the path from the gateway to the router to the backend.
# Gateway accepted and programmed?
kubectl get gateway agentgateway-proxy -n agentgateway-system
kubectl logs -n agentgateway-system deployment/agentgateway
# Routing resources accepted?
kubectl describe httproute semantic-router-vllm -n agentgateway-system
kubectl describe agentgatewaybackend semantic-router-vllm -n agentgateway-system
# ExtProc wired and the router healthy?
kubectl get svc semantic-router -n agentgateway-system
kubectl logs -n agentgateway-system deployment/semantic-router
kubectl describe agentgatewaypolicy semantic-router-extproc -n agentgateway-system
# Demo backend serving?
kubectl get pods -n default -l app=vllm-llama3-8b-instruct
kubectl logs -n default deployment/vllm-llama3-8b-instruct
Takeaways for production
- The router is the brain, the gateway is the enforcement point. Keep routing logic, classifiers, and cost policy in the router; let agentgateway own auth, observability, and forwarding. Clean separation, one endpoint for clients.
- Leave
openai.modelunset on the backend so the router's choice wins. Setting it pins every request to one model and defeats the point. - Mind the buffering. Buffered body mode is simplest, but for large prompts use
FullDuplexStreamedso you don't hold whole requests in memory or add latency. - Treat PII and jailbreak detection as policy, not advice. The value is that these run at the gateway for every client uniformly — wire the actions (block, reroute to an in-cluster model) deliberately, and log them for audit.
- Pin versions. ExtProc
processingOptions/allowModeOverrideneed a recent agentgateway; confirm your distribution's CRDs before promoting beyond a test cluster.
Working with LoRA adapters
This guide leans on LoRA adapters throughout, so here is the practical
detail in one place: what they are, where to get them, how to load them,
and how the router consumes them. One framing first — adapters
are a backend (vLLM) concern. agentgateway and the Semantic Router
only ever pass a model-name string; the adapters themselves live entirely
in the serving layer behind the AgentgatewayBackend.
ghcr.io/llm-d/llm-d-inference-sim) is a
simulator — the --lora-modules args just declare
names it pretends to serve, with no weights and no training. For a real
deployment you replace the simulator with actual vLLM serving real adapter
files. Everything below assumes that.
What a LoRA adapter is
LoRA (Low-Rank Adaptation) is a lightweight fine-tuning technique. Rather than train a whole separate model per domain, you train a small set of extra weights — the adapter — that layers on top of one shared base model. The base model is loaded into GPU memory once; many adapters attach to it and can be swapped per request. That's how a single vLLM backend serves a base model plus several domain “experts” without the cost of several full models. Adapters are small — megabytes, not gigabytes.
Where to get the adapters
- Train your own. LoRA fine-tune the base model on your domain data (maths, legal, support transcripts, …). The output is a small directory containing
adapter_config.jsonandadapter_model.safetensors. - Download pre-trained ones. Hubs like Hugging Face host ready-made LoRA adapters. The hard requirement either way: an adapter must be trained against the same base model you're serving, or vLLM will refuse to load it.
How to load them in vLLM
Serve the base model with LoRA enabled, then register each adapter either statically at startup or dynamically at runtime.
Static — declared on the serve command. Each entry is name=path, where the path is a local directory or a model-hub repo:
vllm serve base-model \
--enable-lora \
--max-loras 6 --max-lora-rank 16 \
--lora-modules \
math-expert=/models/math-lora \
law-expert=/models/law-lora \
science-expert=/models/science-lora
Dynamic — load and unload without a restart. Set VLLM_ALLOW_RUNTIME_LORA_UPDATING=true on the server, then call the admin endpoints:
curl -X POST http://$BACKEND:8000/v1/load_lora_adapter \
-H 'Content-Type: application/json' \
-d '{"lora_name": "math-expert", "lora_path": "/models/math-lora"}'
curl -X POST http://$BACKEND:8000/v1/unload_lora_adapter \
-H 'Content-Type: application/json' \
-d '{"lora_name": "math-expert"}'
How to discover what's loaded
vLLM is OpenAI-compatible, so GET /v1/models lists the base
model and every loaded adapter as its own entry (each adapter
references the base model as its parent). This is the canonical answer to
“what can I route to right now” — for an operator, a
client, or for sanity-checking the router's config:
curl http://$BACKEND:8000/v1/models
# → base-model, math-expert, law-expert, science-expert, …
How they're consumed here
Once loaded, an adapter is addressed exactly like a model: put
"model": "math-expert" in the request body and vLLM applies
that adapter on top of the base model for the request. In this
architecture the client never does that — it sends
"model": "auto", and the Semantic Router writes the chosen
adapter name into the body before agentgateway forwards it. So
consuming the adapter is the router's job; your job is to
load the adapters in vLLM and map each domain category to its adapter name
in the router's configuration.
- how vLLM loaded it (
--lora-modulesor the load call), - what the Semantic Router config emits for that category,
- what
GET /v1/modelsreports (which is just #1 surfaced).
math-expert but vLLM loaded it as
math, the backend returns model not found. The base
model name is part of this too — it's the fallback target for traffic
the router doesn't classify into a specialist.
The full, always-current manifests live in the official guide: vllm-semantic-router.com/docs/installation/k8s/agentgateway. For where this sits alongside identity, authorization, and governance for agentic traffic, see Eight enterprise controls for MCP traffic in agentgateway policy and Securing MCP and agentic systems.