MastertheMesh
Solo · agentgateway · ExtProc · vLLM · Semantic Router · LoRA · kind
Live · Runs on kind

vLLM Semantic Router on agentgateway — model-aware routing as ExtProc

TO
Tom O'Rourke
EMEA Field CTO · Solo.io

One endpoint, one model name: "model": "auto". The vLLM Semantic Router runs inline as a gRPC ExtProc on the gateway, classifies each prompt, and rewrites the request body to pick a LoRA adapter on a vLLM backend. The router decides, the gateway enforces. It runs on the OSS upstream agentgateway, in a single kind cluster.

OSS agentgateway AgentgatewayPolicy ExtProc vLLM LoRA adapters kind

The story: a client sends a chat request to one endpoint and never names a concrete model. It always sends "model": "auto". In front of the model sits agentgateway with the vLLM Semantic Router attached as an external processor. The router classifies the prompt (math, law, biology, business, and so on), checks it, picks the best model or LoRA adapter for that category, and rewrites the request body. agentgateway then forwards the mutated request to the vLLM backend.

The backend here is a simulator that advertises a base model plus six mock LoRA adapters (math-expert, science-expert, social-expert, humanities-expert, law-expert, general-expert). The adapters are names only: no weights, no training. That is all the routing path needs to prove out, and it keeps the whole lab inside a laptop kind cluster.

This reproduces the masterthemesh article on the vLLM Semantic Router. It runs on the OSS upstream agentgateway (cr.agentgateway.dev, the Linux Foundation project). That distribution choice is not incidental: it is the one part of the stack that has to be upstream, and the section below explains why.

What you'll build

Everything lands in one kind cluster across two namespaces. The vLLM simulator runs in default; the gateway, the router, and the routing resources all live in agentgateway-system.

  1. Client posts to /v1/chat/completions with "model": "auto".
  2. agentgateway buffers the request body and calls the router over gRPC ExtProc on :50051.
  3. Semantic Router classifies the prompt, picks a model or LoRA adapter, and rewrites the body.
  4. agentgateway forwards the mutated request to the vLLM backend on :8000.
  5. vLLM serves the completion with the adapter the router chose, and the response flows back.

Why the backend pins no model

The AI backend sets openai: {} with no model override. The model name is taken from the request body, which is exactly the field the router rewrites. Pin a model on the backend and it would override the router's decision, so the routing would silently do nothing.

Why OSS upstream agentgateway, not Solo

The router works by having the gateway buffer the request body, hand it to the router over ExtProc, and then forward the body the router rewrites. That requires ExtProc body-mode control on the policy: processingOptions with requestBodyMode: Buffered and allowModeOverride: true. Those fields exist on the upstream agentgateway CRD.

Solo's agentgateway CRDs do not expose them. Both the OSS-packaged agentgateway.dev set and the Enterprise enterpriseagentgateway.solo.io set have extProc.backendRef only, with no processingOptions. Tested on Solo Enterprise v2.3.3: the router classifies the prompt correctly (the decision shows up in its logs), but the rewritten body is dropped and the backend returns 503 ... EOF while parsing on an empty body. So this lab installs the upstream agentgateway, where the body actually gets forwarded.

The same prompts, the same endpoint, five different adapters

Each request below carries "model": "auto". The router classifies it and rewrites the body, and the decision comes back in the x-vsr-selected-category and x-vsr-selected-model response headers.
Promptcategoryadapter
What is the derivative of x^3?mathmath-expert
What are the elements of a valid contract?lawlaw-expert
How do mRNA vaccines trigger an immune response?healthscience-expert
How should a startup price a SaaS product?businesssocial-expert
What caused the fall of the Roman Republic?historyhumanities-expert

Step 1: cluster and Gateway API CRDs

A two-node kind cluster and the standard Gateway API CRDs. No LoadBalancer is needed: the lab reaches the gateway with kubectl port-forward.

Bashscripts/01-cluster.sh
kind create cluster --config kind/cluster.yaml

kubectl apply -f \
  https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.5.1/standard-install.yaml

Step 2: install agentgateway

The OSS charts pull anonymously from cr.agentgateway.dev, so there is no license and no registry login. The experimental Gateway API features flag is set because ExtProc rides on them.

Bashscripts/02-agentgateway.sh
export AGENTGATEWAY_VERSION=v1.3.0-alpha.1

helm upgrade -i agentgateway-crds oci://cr.agentgateway.dev/charts/agentgateway-crds \
  --namespace agentgateway-system --create-namespace \
  --version "$AGENTGATEWAY_VERSION" \
  --set controller.image.pullPolicy=Always --wait

helm upgrade -i agentgateway oci://cr.agentgateway.dev/charts/agentgateway \
  --namespace agentgateway-system \
  --version "$AGENTGATEWAY_VERSION" \
  --set controller.image.pullPolicy=Always \
  --set controller.extraEnv.KGW_ENABLE_GATEWAY_API_EXPERIMENTAL_FEATURES=true \
  --wait

kubectl get gatewayclass agentgateway

Step 3: deploy the vLLM backend

The llm-d inference simulator speaks the OpenAI wire format and advertises a base model plus six mock LoRA adapters. The adapters are declared as names only, with --max-loras 6.

YAMLyaml/vllm/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama3-8b-instruct
  namespace: default
  labels:
    app: vllm-llama3-8b-instruct
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-llama3-8b-instruct
  template:
    metadata:
      labels:
        app: vllm-llama3-8b-instruct
    spec:
      containers:
        - name: vllm-sim
          image: ghcr.io/llm-d/llm-d-inference-sim:v0.5.0
          args:
            - --model
            - base-model
            - --port
            - "8000"
            - --max-loras
            - "6"
            - --lora-modules
            - '{"name": "math-expert"}'
            - '{"name": "science-expert"}'
            - '{"name": "social-expert"}'
            - '{"name": "humanities-expert"}'
            - '{"name": "law-expert"}'
            - '{"name": "general-expert"}'
          ports:
            - containerPort: 8000
              name: http

Step 4: install the Semantic Router

The router is gateway-agnostic. It is just a gRPC ExtProc server. The lab installs the upstream Helm chart with the vendored agentgateway preset values, which point the router at the in-cluster vLLM Service and map each prompt category to a LoRA adapter. First start downloads classification models, so the wait is generous.

Bashscripts/04-semantic-router.sh
helm upgrade --install semantic-router \
  oci://ghcr.io/vllm-project/charts/semantic-router \
  --version v0.0.0-latest \
  --namespace agentgateway-system \
  -f yaml/semantic-router/values.yaml \
  --wait

What the preset values encode

A model named base-model whose backend_refs endpoint is vllm-llama3-8b-instruct.default.svc.cluster.local:8000, plus a routing table that maps domains to adapters: math to math-expert, law to law-expert, biology and chemistry to science-expert, business and economics to social-expert, history and psychology to humanities-expert.

Step 5: wire the routing resources

Four resources complete the data path. The Gateway on the agentgateway class, an AgentgatewayBackend for the vLLM endpoint, an HTTPRoute pointing at that backend, and the policy that attaches the router as ExtProc.

Gateway and AI backend

YAMLyaml/agentgateway/gateway.yaml + backend.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: vllm-gateway
  namespace: agentgateway-system
spec:
  gatewayClassName: agentgateway
  listeners:
    - name: http
      protocol: HTTP
      port: 80
      allowedRoutes:
        namespaces:
          from: All
---
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayBackend
metadata:
  name: semantic-router-vllm
  namespace: agentgateway-system
spec:
  ai:
    provider:
      openai: {}
      host: vllm-llama3-8b-instruct.default.svc.cluster.local
      port: 8000

HTTPRoute and the ExtProc policy

The route's backendRef points at the AgentgatewayBackend, not a Service. The policy attaches the router to the Gateway. The processingOptions block is what makes body rewriting work: requestBodyMode: Buffered sends the full body to the router, and allowModeOverride: true lets the router's response update how the body is handled so the rewrite is forwarded upstream.

YAMLyaml/agentgateway/httproute.yaml + extproc-policy.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: semantic-router-vllm
  namespace: agentgateway-system
spec:
  parentRefs:
    - name: vllm-gateway
      namespace: agentgateway-system
  rules:
    - backendRefs:
        - name: semantic-router-vllm
          namespace: agentgateway-system
          group: agentgateway.dev
          kind: AgentgatewayBackend
---
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
  name: semantic-router-extproc
  namespace: agentgateway-system
spec:
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: vllm-gateway
  traffic:
    extProc:
      backendRef:
        name: semantic-router
        namespace: agentgateway-system
        port: 50051
      processingOptions:
        requestHeaderMode: Send
        requestBodyMode: Buffered
        responseHeaderMode: Send
        responseBodyMode: Buffered
        allowModeOverride: true

Send a request

Port-forward the gateway, then send any of the prompts below. Every request carries the same "model": "auto"; the router's pick comes back in the x-vsr-selected-category and x-vsr-selected-model response headers, which -i prints. Pick a tab to copy that category's payload. The forward uses local port 18080 to avoid clashing with anything already on 8080 (OrbStack and many dev tools sit there); any free port works.

Bashport-forward the gateway first
kubectl -n agentgateway-system port-forward svc/vllm-gateway 18080:80

Classified math, routed to math-expert.

curl -sS -i -X POST http://localhost:18080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"auto","messages":[{"role":"user","content":"What is the derivative of x^3?"}]}'
# x-vsr-selected-category: math
# x-vsr-selected-model:    math-expert

Classified law, routed to law-expert.

curl -sS -i -X POST http://localhost:18080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"auto","messages":[{"role":"user","content":"What are the elements of a valid contract?"}]}'
# x-vsr-selected-category: law
# x-vsr-selected-model:    law-expert

Classified health, routed to science-expert.

curl -sS -i -X POST http://localhost:18080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"auto","messages":[{"role":"user","content":"How do mRNA vaccines trigger an immune response?"}]}'
# x-vsr-selected-category: health
# x-vsr-selected-model:    science-expert

Classified business, routed to social-expert.

curl -sS -i -X POST http://localhost:18080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"auto","messages":[{"role":"user","content":"How should a startup price a SaaS product?"}]}'
# x-vsr-selected-category: business
# x-vsr-selected-model:    social-expert

Classified history, routed to humanities-expert.

curl -sS -i -X POST http://localhost:18080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"auto","messages":[{"role":"user","content":"What caused the fall of the Roman Republic?"}]}'
# x-vsr-selected-category: history
# x-vsr-selected-model:    humanities-expert

Different categories returning different adapters is the proof: the router rewrote the body and the gateway forwarded the mutation. ./scripts/test.sh runs all five and prints the routed adapter for each.

Troubleshooting

If every prompt routes to the same adapter, the body rewrite did not take effect. Start with the router's own decision logs, then confirm the policy attached.

Bashwhere the routing decision lives
# Router decision logs (classification + chosen adapter)
kubectl -n agentgateway-system logs deploy/semantic-router

# Policy and route accepted?
kubectl -n agentgateway-system describe agentgatewaypolicy semantic-router-extproc
kubectl -n agentgateway-system describe httproute semantic-router-vllm

# Backend serving + adapters loaded?
kubectl -n default exec deploy/vllm-llama3-8b-instruct -- wget -qO- localhost:8000/v1/models

Working with real LoRA adapters

The simulator declares adapter names with no weights. For production, swap it for real vLLM serving real adapter files. The contract that makes routing work is name matching: the adapter name must be identical across how vLLM loaded it, what the router config emits per category, and what GET /v1/models reports. A mismatch returns "model not found". An adapter must be trained against the same base model being served, or vLLM refuses to load it.

Bashstatic load at startup, then discover
vllm serve base-model \
  --enable-lora --max-loras 6 --max-lora-rank 16 \
  --lora-modules math-expert=/models/math-lora law-expert=/models/law-lora

curl http://$BACKEND:8000/v1/models   # base model + each adapter

See also

Versions

Built and verified on:

OSS
agentgateway (OSS)v1.3.0-alpha.1
Gateway APIv1.5.0