Intelligent LLM routing: vLLM Semantic Router on agentgateway, by Tom O'Rourke

What the Semantic Router is, and where agentgateway fits

The vLLM Semantic Router is an open-source routing layer for LLM traffic, from the vLLM project. Instead of every request hitting one fixed model, it inspects each prompt and decides, in real time, which model should answer it — a cheap local model for routine questions, a specialised model or LoRA adapter for a domain like maths or law, a frontier model for genuinely hard reasoning. Along the way it can detect personally identifiable information, intercept jailbreak attempts, and serve semantically-similar answers from cache.

What's a LoRA adapter? LoRA (Low-Rank Adaptation) is a lightweight way to specialise a model. Rather than fine-tuning a whole separate model for each domain — which would mean loading several large models — you train a small set of extra weights, the adapter, that layers on top of one shared base model. The base model is loaded once, and many adapters (maths, law, science…) can be attached cheaply and switched per request. That's how a single vLLM backend in this guide serves a base model plus six domain “experts” without running six full models — and why routing to the right adapter is so much cheaper than routing to the right standalone model. For how to obtain, load, and consume them, see Working with LoRA adapters at the foot of this guide.

agentgateway is the Kubernetes Gateway API proxy that sits in front of your models. It already terminates the OpenAI-compatible API your clients call, authenticates them, and applies policy. The integration wires the Semantic Router into that data path as an External Processing (ExtProc) service: agentgateway streams each request out to the router, the router classifies it and rewrites the request body to name the chosen model, and agentgateway forwards the mutated request to the backend. The router decides; the gateway enforces. Your application code never changes — it keeps sending "model": "auto" to a single endpoint.

Source of truth. This guide follows the official installation walkthrough — vllm-semantic-router.com/docs/installation/k8s/agentgateway — and condenses it with the “why” and the enterprise framing around it. Treat the upstream doc as canonical for exact versions and manifests.

Why this matters

Modern deployments don't have a model — they have a fleet. Models now differ on quality, cost, latency, privacy, and modality, and the moment you run more than one, the routing decision is the product. Hard-coding a model name in the client is the thing you regret six months later: you can't shift traffic to a cheaper model, you can't send sensitive prompts to an on-prem model, and every change is a client redeploy. Pushing the decision into a routing layer at the gateway turns all of that into config. Four reasons it's worth doing:

Cost

Most traffic is routine and doesn't need your most expensive model. Route the easy 80% to an efficient local or small model and reserve frontier models for the prompts that actually need them. The project cites research on order-of-magnitude reductions in effective inference cost from exactly this kind of signal-driven routing.

Accuracy

A prompt classified as “maths” can be sent to a maths-tuned model or LoRA adapter; a legal question to a legal one. Matching the request to a specialist beats forcing one generalist to cover every domain — better answers without a bigger, costlier model.

Safety & privacy

The router detects PII and jailbreak / prompt-injection attempts before the request reaches a model or leaves your boundary. Sensitive prompts can be kept on an in-cluster model; obvious attacks can be blocked at the edge rather than relied on the model to refuse.

Zero client change

Because the routing runs as ExtProc at the gateway, clients keep calling one OpenAI-compatible endpoint with "model": "auto". Teams don't hard-code model names, and you can change routing strategy centrally without touching a single application.

Put differently: the gateway is already the one place every LLM request passes through, already authenticated and observable. That's the natural home for a routing brain. The signals the router extracts on the way through — domain, safety, similarity — are the same signals you'd want for cost control, governance, and audit anyway.

Domain classification Jailbreak detection PII detection Semantic cache Model / LoRA selection Reasoning-mode control

How it works: ExtProc in the request path

ExtProc is the agentgateway mechanism for handing a request out to an external service that can mutate it mid-flight — not just allow or deny it, but rewrite the headers and body before the gateway forwards it on. (It's the same family of extension point covered in the CEL vs OPA vs ext-authz vs ext-proc post; the Semantic Router is a textbook use of it.) Here the external service is the router, reachable over gRPC, and agentgateway is told to buffer the request and response bodies so the router can see and rewrite the prompt.

The router decides, the gateway enforces — clients send model: auto to one endpoint and never change.

The key design choice is on the backend object: the openai.model field is deliberately left unset. That tells agentgateway to use whatever model name is in the request body — which, by the time the backend sees it, is the model the router chose. The router holds the intelligence; the gateway and backend just honour its decision.

Where it earns its keep: enterprise use cases

The pattern is general, but a few scenarios show why teams reach for it. These are illustrative, framed by capability rather than tied to any one organisation.

Financial services

Specialist routing with data kept in-boundary

An internal assistant fields everything from “reset my password” to regulatory-reporting questions. Routine queries go to a small in-cluster model; compliance and risk questions go to a domain-tuned adapter that answers them well. The router's PII detection flags account numbers and client identifiers, so prompts carrying them are routed to an on-prem model instead of an external frontier API.

Win: better domain answers, and sensitive data never leaves the cluster — both enforced at the gateway, not trusted to each app.

SaaS platform

Cutting the bill on a high-volume chat feature

A product's in-app assistant serves millions of messages a month, and the bulk are FAQ-style or short follow-ups that a small model handles perfectly. Before, everything hit a frontier model “to be safe.” With semantic routing, the easy majority drop to a cheap local model and only genuinely complex prompts escalate — with a semantic cache absorbing near-duplicate questions on top.

Win: a large cut in inference spend with no measurable drop in answer quality, and no client release to ship it.

Healthcare

PII / PHI containment as an infrastructure control

Clinicians use a general assistant, but patient identifiers must never reach an external model. The router's token-level PII detection runs on every prompt at the gateway; anything carrying identifiers is kept on an in-cluster model, while de-identified general questions can use a larger external one. The control lives in one place and applies to every client uniformly.

Win: data-residency and PHI rules enforced centrally and auditable, instead of depending on every team's prompt hygiene.

Platform / AI gateway team

One endpoint, many models, governed centrally

A central team offers “the LLM endpoint” to dozens of internal apps. They don't want every team pinning model names or re-implementing jailbreak filters. Apps call one endpoint with "model": "auto"; the platform team owns the routing table, the safety classifiers, and the cost policy in one place — and can swap models or shift traffic without a single downstream change.

Win: model choice, safety, and cost become platform policy, not scattered application code.

Prerequisites

You need a Kubernetes cluster and the usual tooling. The one version constraint to note: agentgateway must be recent enough to support the ExtProc processingOptions and allowModeOverride fields the router relies on.

kubectl and Helm
kind — optional, only if you want a local throwaway cluster
agentgateway v1.3.0-alpha.1 or newer — required for ExtProc processingOptions + allowModeOverride

Upstream vs enterprise CRDs. agentgateway ships two CRD families, and they are not interchangeable by name:

Open-source agentgateway — API group agentgateway.dev/v1alpha1, kinds AgentgatewayBackend, AgentgatewayPolicy, AgentgatewayParameters. This is what the vLLM docs and every manifest below use.
Solo Enterprise for agentgateway — API group enterpriseagentgateway.solo.io/v1alpha1, kinds EnterpriseAgentgatewayBackend, EnterpriseAgentgatewayPolicy, EnterpriseAgentgatewayParameters (the family used in Eight enterprise controls for MCP traffic).

The ExtProc routing pattern is the same on both, but the apiVersion and kind differ, and field availability can vary by version — the processingOptions/allowModeOverride fields used here landed in the v1.3.0-alpha.1 line. If you're on enterprise, translate the kinds and confirm the fields against your installed CRDs before applying.

Deploy it, step by step

Seven steps: a cluster, agentgateway, a gateway proxy, a demo model, the router, the routing resources, and the ExtProc attachment. Commands and manifests below mirror the official docs — reach for those if a version has moved on.

Create a local cluster (optional)

Skip if you already have a cluster.

kind create cluster --name semantic-router-agentgateway
kubectl wait --for=condition=Ready nodes --all --timeout=300s

Install agentgateway

Install the Gateway API CRDs, then the agentgateway CRDs and controller. Experimental Gateway API features are enabled because ExtProc rides on them.

export AGENTGATEWAY_VERSION=v1.3.0-alpha.1

kubectl apply --server-side --force-conflicts \
  -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.5.0/standard-install.yaml

helm upgrade -i agentgateway-crds oci://cr.agentgateway.dev/charts/agentgateway-crds \
  --create-namespace \
  --namespace agentgateway-system \
  --version "${AGENTGATEWAY_VERSION}" \
  --set controller.image.pullPolicy=Always

helm upgrade -i agentgateway oci://cr.agentgateway.dev/charts/agentgateway \
  --namespace agentgateway-system \
  --version "${AGENTGATEWAY_VERSION}" \
  --set controller.image.pullPolicy=Always \
  --set controller.extraEnv.KGW_ENABLE_GATEWAY_API_EXPERIMENTAL_FEATURES=true \
  --wait

kubectl get pods -n agentgateway-system

Create the gateway proxy

A standard Gateway API Gateway on the agentgateway class, listening on port 80.

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: agentgateway-proxy
  namespace: agentgateway-system
spec:
  gatewayClassName: agentgateway
  listeners:
  - protocol: HTTP
    port: 80
    name: http
    allowedRoutes:
      namespaces:
        from: All

kubectl apply -f gateway.yaml
kubectl wait --for=condition=Available deployment/agentgateway-proxy \
  -n agentgateway-system --timeout=300s

Deploy a demo vLLM-compatible backend

An OpenAI-compatible simulator serving a base model plus six LoRA adapters (math, science, social, humanities, law, general) — the specialists the router will route to. The full Deployment + Service is in the docs; the shape is:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama3-8b-instruct
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels: { app: vllm-llama3-8b-instruct }
  template:
    metadata:
      labels: { app: vllm-llama3-8b-instruct }
    spec:
      containers:
      - name: vllm-sim
        image: ghcr.io/llm-d/llm-d-inference-sim:v0.5.0
        args:
        - --model
        - base-model
        - --port
        - "8000"
        - --max-loras
        - "6"
        - --lora-modules
        - '{"name": "math-expert"}'
        - '{"name": "science-expert"}'
        - '{"name": "social-expert"}'
        - '{"name": "humanities-expert"}'
        - '{"name": "law-expert"}'
        - '{"name": "general-expert"}'
        ports:
        - { containerPort: 8000, name: http }
        readinessProbe:
          httpGet: { path: /health, port: http }
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-llama3-8b-instruct
  namespace: default
spec:
  selector: { app: vllm-llama3-8b-instruct }
  ports:
  - { port: 8000, targetPort: 8000 }

kubectl wait --for=condition=Available deployment/vllm-llama3-8b-instruct \
  -n default --timeout=300s

Deploy the Semantic Router

Install the router via Helm with the agentgateway-specific values file, which points it at the demo backend and configures the LoRA adapter selection.

helm install semantic-router oci://ghcr.io/vllm-project/charts/semantic-router \
  --version v0.0.0-latest \
  --namespace agentgateway-system \
  -f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/agentgateway/semantic-router-values/values.yaml

kubectl wait --for=condition=Available deployment/semantic-router \
  -n agentgateway-system --timeout=600s

Alternative to Step 5 — the Semantic Router Operator. Instead of the Helm chart, the project ships a Kubernetes operator that manages the router declaratively through a SemanticRouter custom resource (vllm.ai/v1alpha1). It reconciles the Deployment, the Service (gRPC 50051, HTTP 8080, metrics 9190), the ConfigMap, a PVC for model storage, an optional HorizontalPodAutoscaler, and RBAC — and it auto-detects OpenShift vs vanilla Kubernetes to set security contexts. Reach for it when you're on OpenShift or want CRD-driven lifecycle management rather than Helm. Install it and apply a router instance:

git clone https://github.com/vllm-project/semantic-router
cd semantic-router/deploy/operator
make install
make deploy IMG=ghcr.io/vllm-project/semantic-router-operator:latest

apiVersion: vllm.ai/v1alpha1
kind: SemanticRouter
metadata:
  name: semantic-router
  namespace: agentgateway-system
spec:
  replicas: 2
  image:
    repository: ghcr.io/vllm-project/semantic-router/extproc
    tag: latest
  vllmEndpoints:
    - name: llama3-8b-endpoint
      model: llama3-8b
      backend:
        type: kserve
        inferenceServiceName: llama-3-8b
      weight: 1
  resources:
    requests: { memory: "3Gi", cpu: "1" }
    limits:   { memory: "7Gi", cpu: "2" }

This swaps out only the router deployment — Steps 1–4 and 6–7 are unchanged. Two things to reconcile: point vllmEndpoints at the backend you're actually serving (the CR's routing config differs from the agentgateway-specific Helm values used above), and make sure the ExtProc backendRef in Step 7 targets the Service name, namespace, and gRPC port (50051) that the operator creates from this CR.

Create the routing resources

An AgentgatewayBackend pointing at the vLLM service, and an HTTPRoute binding it to the gateway. Note the backend omits openai.model on purpose — so the model name the router writes into the request body is the one that's used.

apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayBackend
metadata:
  name: semantic-router-vllm
  namespace: agentgateway-system
spec:
  ai:
    provider:
      openai: {}          # model intentionally omitted
      host: vllm-llama3-8b-instruct.default.svc.cluster.local
      port: 8000
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: semantic-router-vllm
  namespace: agentgateway-system
spec:
  parentRefs:
  - name: agentgateway-proxy
    namespace: agentgateway-system
  rules:
  - backendRefs:
    - name: semantic-router-vllm
      namespace: agentgateway-system
      group: agentgateway.dev
      kind: AgentgatewayBackend

Attach the Semantic Router as ExtProc

An AgentgatewayPolicy targeting the gateway, sending request and response bodies to the router (buffered) and allowing it to override the processing mode. The processingOptions and allowModeOverride fields are the reason this integration needs agentgateway v1.3.0-alpha.1 or newer. For large prompts, switch requestBodyMode to FullDuplexStreamed with the matching router config.

apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
  name: semantic-router-extproc
  namespace: agentgateway-system
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: agentgateway-proxy
  traffic:
    extProc:
      backendRef:
        name: semantic-router
        namespace: agentgateway-system
        port: 50051
      processingOptions:
        requestHeaderMode: Send
        requestBodyMode: Buffered
        responseHeaderMode: Send
        responseBodyMode: Buffered
        allowModeOverride: true

Send a request through it

Port-forward the gateway and send an OpenAI-style request with "model": "auto". The router classifies the maths prompt, selects the maths route, and rewrites the request before agentgateway forwards it — the client never names a concrete model.

kubectl port-forward -n agentgateway-system svc/agentgateway-proxy 8080:80

curl -i -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [
      {"role": "user", "content": "What is the derivative of f(x) = x^3?"}
    ],
    "max_tokens": 64,
    "temperature": 0
  }'

Troubleshooting

If a request doesn't route as expected, walk the path from the gateway to the router to the backend.

# Gateway accepted and programmed?
kubectl get gateway agentgateway-proxy -n agentgateway-system
kubectl logs -n agentgateway-system deployment/agentgateway

# Routing resources accepted?
kubectl describe httproute semantic-router-vllm -n agentgateway-system
kubectl describe agentgatewaybackend semantic-router-vllm -n agentgateway-system

# ExtProc wired and the router healthy?
kubectl get svc semantic-router -n agentgateway-system
kubectl logs -n agentgateway-system deployment/semantic-router
kubectl describe agentgatewaypolicy semantic-router-extproc -n agentgateway-system

# Demo backend serving?
kubectl get pods -n default -l app=vllm-llama3-8b-instruct
kubectl logs -n default deployment/vllm-llama3-8b-instruct

Takeaways for production

The router is the brain, the gateway is the enforcement point. Keep routing logic, classifiers, and cost policy in the router; let agentgateway own auth, observability, and forwarding. Clean separation, one endpoint for clients.
Leave openai.model unset on the backend so the router's choice wins. Setting it pins every request to one model and defeats the point.
Mind the buffering. Buffered body mode is simplest, but for large prompts use FullDuplexStreamed so you don't hold whole requests in memory or add latency.
Treat PII and jailbreak detection as policy, not advice. The value is that these run at the gateway for every client uniformly — wire the actions (block, reroute to an in-cluster model) deliberately, and log them for audit.
Pin versions. ExtProc processingOptions/allowModeOverride need a recent agentgateway; confirm your distribution's CRDs before promoting beyond a test cluster.

Working with LoRA adapters

This guide leans on LoRA adapters throughout, so here is the practical detail in one place: what they are, where to get them, how to load them, and how the router consumes them. One framing first — adapters are a backend (vLLM) concern. agentgateway and the Semantic Router only ever pass a model-name string; the adapters themselves live entirely in the serving layer behind the AgentgatewayBackend.

The demo doesn't use real adapters. The backend in this guide (ghcr.io/llm-d/llm-d-inference-sim) is a simulator — the --lora-modules args just declare names it pretends to serve, with no weights and no training. For a real deployment you replace the simulator with actual vLLM serving real adapter files. Everything below assumes that.

What a LoRA adapter is

LoRA (Low-Rank Adaptation) is a lightweight fine-tuning technique. Rather than train a whole separate model per domain, you train a small set of extra weights — the adapter — that layers on top of one shared base model. The base model is loaded into GPU memory once; many adapters attach to it and can be swapped per request. That's how a single vLLM backend serves a base model plus several domain “experts” without the cost of several full models. Adapters are small — megabytes, not gigabytes.

Where to get the adapters

Train your own. LoRA fine-tune the base model on your domain data (maths, legal, support transcripts, …). The output is a small directory containing adapter_config.json and adapter_model.safetensors.
Download pre-trained ones. Hubs like Hugging Face host ready-made LoRA adapters. The hard requirement either way: an adapter must be trained against the same base model you're serving, or vLLM will refuse to load it.

How to load them in vLLM

Serve the base model with LoRA enabled, then register each adapter either statically at startup or dynamically at runtime.

Static — declared on the serve command. Each entry is name=path, where the path is a local directory or a model-hub repo:

vllm serve base-model \
  --enable-lora \
  --max-loras 6 --max-lora-rank 16 \
  --lora-modules \
    math-expert=/models/math-lora \
    law-expert=/models/law-lora \
    science-expert=/models/science-lora

Dynamic — load and unload without a restart. Set VLLM_ALLOW_RUNTIME_LORA_UPDATING=true on the server, then call the admin endpoints:

curl -X POST http://$BACKEND:8000/v1/load_lora_adapter \
  -H 'Content-Type: application/json' \
  -d '{"lora_name": "math-expert", "lora_path": "/models/math-lora"}'

curl -X POST http://$BACKEND:8000/v1/unload_lora_adapter \
  -H 'Content-Type: application/json' \
  -d '{"lora_name": "math-expert"}'

How to discover what's loaded

vLLM is OpenAI-compatible, so GET /v1/models lists the base model and every loaded adapter as its own entry (each adapter references the base model as its parent). This is the canonical answer to “what can I route to right now” — for an operator, a client, or for sanity-checking the router's config:

curl http://$BACKEND:8000/v1/models
# → base-model, math-expert, law-expert, science-expert, …

How they're consumed here

Once loaded, an adapter is addressed exactly like a model: put "model": "math-expert" in the request body and vLLM applies that adapter on top of the base model for the request. In this architecture the client never does that — it sends "model": "auto", and the Semantic Router writes the chosen adapter name into the body before agentgateway forwards it. So consuming the adapter is the router's job; your job is to load the adapters in vLLM and map each domain category to its adapter name in the router's configuration.

The name-matching contract — the thing that actually bites. The adapter name must be identical in three places:

how vLLM loaded it (--lora-modules or the load call),
what the Semantic Router config emits for that category,
what GET /v1/models reports (which is just #1 surfaced).

If the router emits math-expert but vLLM loaded it as math, the backend returns model not found. The base model name is part of this too — it's the fallback target for traffic the router doesn't classify into a specialist.

The full, always-current manifests live in the official guide: vllm-semantic-router.com/docs/installation/k8s/agentgateway. For where this sits alongside identity, authorization, and governance for agentic traffic, see Eight enterprise controls for MCP traffic in agentgateway policy and Securing MCP and agentic systems.