The story: a client sends a chat request to one endpoint and never names a concrete
model. It always sends "model": "auto". In front of the model sits agentgateway with the
vLLM Semantic Router attached as an external processor. The router classifies the prompt (math, law,
biology, business, and so on), checks it, picks the best model or LoRA adapter for that category, and
rewrites the request body. agentgateway then forwards the mutated request to the vLLM backend.
The backend here is a simulator that advertises a base model plus six mock LoRA adapters
(math-expert, science-expert, social-expert,
humanities-expert, law-expert, general-expert). The adapters
are names only: no weights, no training. That is all the routing path needs to prove out, and it keeps
the whole lab inside a laptop kind cluster.
This reproduces the masterthemesh article on the vLLM Semantic Router. It runs on the
OSS upstream agentgateway (cr.agentgateway.dev, the Linux Foundation
project). That distribution choice is not incidental: it is the one part of the stack that has to be
upstream, and the section below explains why.
What you'll build
Everything lands in one kind cluster across two namespaces. The vLLM simulator runs in
default; the gateway, the router, and the routing resources all live in
agentgateway-system.
- Client posts to
/v1/chat/completionswith"model": "auto". - agentgateway buffers the request body and calls the router over gRPC ExtProc on
:50051. - Semantic Router classifies the prompt, picks a model or LoRA adapter, and rewrites the body.
- agentgateway forwards the mutated request to the vLLM backend on
:8000. - vLLM serves the completion with the adapter the router chose, and the response flows back.
openai: {} with no model override. The model name is
taken from the request body, which is exactly the field the router rewrites. Pin a model on the backend
and it would override the router's decision, so the routing would silently do nothing.
Why OSS upstream agentgateway, not Solo
The router works by having the gateway buffer the request body, hand it to the router over ExtProc, and
then forward the body the router rewrites. That requires ExtProc body-mode control on the policy:
processingOptions with requestBodyMode: Buffered and
allowModeOverride: true. Those fields exist on the upstream agentgateway CRD.
Solo's agentgateway CRDs do not expose them. Both the OSS-packaged agentgateway.dev set and
the Enterprise enterpriseagentgateway.solo.io set have extProc.backendRef only,
with no processingOptions. Tested on Solo Enterprise v2.3.3: the router classifies the prompt
correctly (the decision shows up in its logs), but the rewritten body is dropped and the backend returns
503 ... EOF while parsing on an empty body. So this lab installs the upstream agentgateway,
where the body actually gets forwarded.
"model": "auto". The router classifies it and rewrites the body,
and the decision comes back in the x-vsr-selected-category and
x-vsr-selected-model response headers.
| Prompt | category | adapter |
|---|---|---|
| What is the derivative of x^3? | math | math-expert |
| What are the elements of a valid contract? | law | law-expert |
| How do mRNA vaccines trigger an immune response? | health | science-expert |
| How should a startup price a SaaS product? | business | social-expert |
| What caused the fall of the Roman Republic? | history | humanities-expert |
Step 1: cluster and Gateway API CRDs
A two-node kind cluster and the standard Gateway API CRDs. No LoadBalancer is needed: the lab reaches the
gateway with kubectl port-forward.
Bashscripts/01-cluster.sh
kind create cluster --config kind/cluster.yaml
kubectl apply -f \
https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.5.1/standard-install.yaml
Step 2: install agentgateway
The OSS charts pull anonymously from cr.agentgateway.dev, so there is no license and no
registry login. The experimental Gateway API features flag is set because ExtProc rides on them.
Bashscripts/02-agentgateway.sh
export AGENTGATEWAY_VERSION=v1.3.0-alpha.1
helm upgrade -i agentgateway-crds oci://cr.agentgateway.dev/charts/agentgateway-crds \
--namespace agentgateway-system --create-namespace \
--version "$AGENTGATEWAY_VERSION" \
--set controller.image.pullPolicy=Always --wait
helm upgrade -i agentgateway oci://cr.agentgateway.dev/charts/agentgateway \
--namespace agentgateway-system \
--version "$AGENTGATEWAY_VERSION" \
--set controller.image.pullPolicy=Always \
--set controller.extraEnv.KGW_ENABLE_GATEWAY_API_EXPERIMENTAL_FEATURES=true \
--wait
kubectl get gatewayclass agentgateway
Step 3: deploy the vLLM backend
The llm-d inference simulator speaks the OpenAI wire format and advertises a base model plus
six mock LoRA adapters. The adapters are declared as names only, with --max-loras 6.
YAMLyaml/vllm/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama3-8b-instruct
namespace: default
labels:
app: vllm-llama3-8b-instruct
spec:
replicas: 1
selector:
matchLabels:
app: vllm-llama3-8b-instruct
template:
metadata:
labels:
app: vllm-llama3-8b-instruct
spec:
containers:
- name: vllm-sim
image: ghcr.io/llm-d/llm-d-inference-sim:v0.5.0
args:
- --model
- base-model
- --port
- "8000"
- --max-loras
- "6"
- --lora-modules
- '{"name": "math-expert"}'
- '{"name": "science-expert"}'
- '{"name": "social-expert"}'
- '{"name": "humanities-expert"}'
- '{"name": "law-expert"}'
- '{"name": "general-expert"}'
ports:
- containerPort: 8000
name: http
Step 4: install the Semantic Router
The router is gateway-agnostic. It is just a gRPC ExtProc server. The lab installs the upstream Helm chart with the vendored agentgateway preset values, which point the router at the in-cluster vLLM Service and map each prompt category to a LoRA adapter. First start downloads classification models, so the wait is generous.
Bashscripts/04-semantic-router.sh
helm upgrade --install semantic-router \
oci://ghcr.io/vllm-project/charts/semantic-router \
--version v0.0.0-latest \
--namespace agentgateway-system \
-f yaml/semantic-router/values.yaml \
--wait
base-model whose backend_refs endpoint is
vllm-llama3-8b-instruct.default.svc.cluster.local:8000, plus a routing
table that maps domains to adapters: math to math-expert, law to law-expert,
biology and chemistry to science-expert, business and economics to
social-expert, history and psychology to humanities-expert.
Step 5: wire the routing resources
Four resources complete the data path. The Gateway on the agentgateway class, an AgentgatewayBackend for the vLLM endpoint, an HTTPRoute pointing at that backend, and the policy that attaches the router as ExtProc.
Gateway and AI backend
YAMLyaml/agentgateway/gateway.yaml + backend.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: vllm-gateway
namespace: agentgateway-system
spec:
gatewayClassName: agentgateway
listeners:
- name: http
protocol: HTTP
port: 80
allowedRoutes:
namespaces:
from: All
---
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayBackend
metadata:
name: semantic-router-vllm
namespace: agentgateway-system
spec:
ai:
provider:
openai: {}
host: vllm-llama3-8b-instruct.default.svc.cluster.local
port: 8000
HTTPRoute and the ExtProc policy
The route's backendRef points at the AgentgatewayBackend, not a Service. The policy attaches
the router to the Gateway. The processingOptions block is what makes body rewriting work:
requestBodyMode: Buffered sends the full body to the router, and
allowModeOverride: true lets the router's response update how the body is handled so the
rewrite is forwarded upstream.
YAMLyaml/agentgateway/httproute.yaml + extproc-policy.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: semantic-router-vllm
namespace: agentgateway-system
spec:
parentRefs:
- name: vllm-gateway
namespace: agentgateway-system
rules:
- backendRefs:
- name: semantic-router-vllm
namespace: agentgateway-system
group: agentgateway.dev
kind: AgentgatewayBackend
---
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
name: semantic-router-extproc
namespace: agentgateway-system
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: vllm-gateway
traffic:
extProc:
backendRef:
name: semantic-router
namespace: agentgateway-system
port: 50051
processingOptions:
requestHeaderMode: Send
requestBodyMode: Buffered
responseHeaderMode: Send
responseBodyMode: Buffered
allowModeOverride: true
Send a request
Port-forward the gateway, then send any of the prompts below. Every request carries the same
"model": "auto"; the router's pick comes back in the
x-vsr-selected-category and x-vsr-selected-model response headers, which
-i prints. Pick a tab to copy that category's payload. The forward uses local port
18080 to avoid clashing with anything already on 8080 (OrbStack and many dev tools sit there); any
free port works.
Bashport-forward the gateway first
kubectl -n agentgateway-system port-forward svc/vllm-gateway 18080:80
Classified math, routed to math-expert.
curl -sS -i -X POST http://localhost:18080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"auto","messages":[{"role":"user","content":"What is the derivative of x^3?"}]}'
# x-vsr-selected-category: math
# x-vsr-selected-model: math-expert
Classified law, routed to law-expert.
curl -sS -i -X POST http://localhost:18080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"auto","messages":[{"role":"user","content":"What are the elements of a valid contract?"}]}'
# x-vsr-selected-category: law
# x-vsr-selected-model: law-expert
Classified health, routed to science-expert.
curl -sS -i -X POST http://localhost:18080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"auto","messages":[{"role":"user","content":"How do mRNA vaccines trigger an immune response?"}]}'
# x-vsr-selected-category: health
# x-vsr-selected-model: science-expert
Classified business, routed to social-expert.
curl -sS -i -X POST http://localhost:18080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"auto","messages":[{"role":"user","content":"How should a startup price a SaaS product?"}]}'
# x-vsr-selected-category: business
# x-vsr-selected-model: social-expert
Classified history, routed to humanities-expert.
curl -sS -i -X POST http://localhost:18080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"auto","messages":[{"role":"user","content":"What caused the fall of the Roman Republic?"}]}'
# x-vsr-selected-category: history
# x-vsr-selected-model: humanities-expert
Different categories returning different adapters is the proof: the router rewrote the body and the
gateway forwarded the mutation. ./scripts/test.sh runs all five and prints the routed
adapter for each.
Troubleshooting
If every prompt routes to the same adapter, the body rewrite did not take effect. Start with the router's own decision logs, then confirm the policy attached.
Bashwhere the routing decision lives
# Router decision logs (classification + chosen adapter)
kubectl -n agentgateway-system logs deploy/semantic-router
# Policy and route accepted?
kubectl -n agentgateway-system describe agentgatewaypolicy semantic-router-extproc
kubectl -n agentgateway-system describe httproute semantic-router-vllm
# Backend serving + adapters loaded?
kubectl -n default exec deploy/vllm-llama3-8b-instruct -- wget -qO- localhost:8000/v1/models
Working with real LoRA adapters
The simulator declares adapter names with no weights. For production, swap it for real vLLM serving real
adapter files. The contract that makes routing work is name matching: the adapter name must be identical
across how vLLM loaded it, what the router config emits per category, and what GET /v1/models
reports. A mismatch returns "model not found". An adapter must be trained against the same base model
being served, or vLLM refuses to load it.
Bashstatic load at startup, then discover
vllm serve base-model \
--enable-lora --max-loras 6 --max-lora-rank 16 \
--lora-modules math-expert=/models/math-lora law-expert=/models/law-lora
curl http://$BACKEND:8000/v1/models # base model + each adapter
See also
- vLLM Semantic Router — agentgateway install docs (canonical, OSS)
- OSS agentgateway project
- Sibling lab — AI Data Loss Prevention with promptGuard on kind
Versions
Built and verified on:
v1.3.0-alpha.1v1.5.0