MastertheMesh
kgateway · agentgateway · Reference
Reference

External Processing (ExtProc) — streaming request/response mutation

TO
Tom O'Rourke
EMEA Field CTO · Solo.io

ExtProc is the Envoy filter that hands HTTP requests and responses to your own gRPC service over a bidirectional stream. You can read and mutate any phase: headers, body bytes, trailers, on both directions. That's the lever for prompt guards, PII redaction, semantic caching, response signing, body-shape transformations, and dynamic routing decisions that need more than a header lookup. Companion to the Solo ext-auth-service field guide.

ExtProc ext_proc.v3 processingMode GatewayExtension EnterpriseKgatewayTrafficPolicy EnterpriseAgentgatewayPolicy vs ExtAuth

ExtAuth answers one question: should this request proceed? ExtProc answers a different one: what should this request or response actually look like? Same external-service pattern, very different surface area. This page walks the six processing phases, the GatewayExtension + traffic-policy wiring on both kgateway and agentgateway, and a small Python (and Go) server you can deploy in a kind cluster in under a minute.

Companion reference: Solo external auth service, the bouncer at the door. ExtProc is what runs after the bouncer waves the request through.

1 · What ExtProc can do

ExtProc is defined by envoy.service.ext_proc.v3.ExternalProcessor. Envoy holds a single bidirectional gRPC stream open per HTTP request and sends a ProcessingRequest message at each enabled phase. Your server replies with a ProcessingResponse that can carry a header mutation, a body mutation, an immediate response (terminate the request with a status code), or a dynamic-metadata write. The stream stays open until the request ends.

CapabilityDirectionNotes
Add, set, remove, append headersreq & resp Same shape as ExtAuth's header mutation, but on every phase.
Replace the bodyreq & resp Buffered or streamed. The buffer cap is configurable on the gateway.
Inspect trailersreq & resp Useful for HTTP/2 gRPC where status arrives in trailers.
Immediate responseany phase Terminate the request from the ExtProc server (status, headers, body). The way to short-circuit on a guardrail hit.
Dynamic metadataany phase Write structured metadata that downstream filters (auth, rate-limit, telemetry) can read.
Dynamic routing decisionsrequest headers / body Change the route cluster or override weights before Envoy commits.
Async observabilityany phase Just acknowledge the message, do the work in the background. The stream stays open and Envoy keeps going.
vs ExtAuth. ExtAuth runs before routing, gets request attributes, returns OK or Denied with optional header changes, then it's done. ExtProc runs through the lifecycle, gets bidirectional streaming, sees the body, and can mutate the response. If you need to touch the body, you need ExtProc. If you need to allow or deny, ExtAuth is cheaper.

2 · The processing flow

Envoy can send your server up to six phases per request. You opt into each phase via processingMode on the GatewayExtension. Skip the phases you don't need, every enabled phase is a gRPC round-trip.

REQ Request headers requestHeaderMode

Fires after Envoy parses the request line and headers, before route selection. Your reply can mutate headers, override the route cluster, or short-circuit with an immediate response.

REQ Request body requestBodyMode

BUFFERED sends the whole body in one message, STREAMED sends it chunk by chunk, BUFFERED_PARTIAL sends up to the configured cap. NONE skips the phase.

REQ Request trailers requestTrailerMode

Only fires when the request has trailers (HTTP/2 gRPC, chunked uploads with trailers).

RESP Response headers responseHeaderMode

Fires after the upstream responds, before bytes are sent downstream. Same mutation surface as the request side.

RESP Response body responseBodyMode

This is the LLM-streaming phase. STREAMED lets you act on SSE chunks as they arrive. BUFFERED defeats streaming, the client sees nothing until the whole response is collected.

RESP Response trailers responseTrailerMode

gRPC status lives here. Inspect to record success/failure metrics.

Each enabled phase is a separate ProcessingRequest / ProcessingResponse exchange on the same stream. The server can also tell Envoy "skip the rest" with processing_mode_override, useful when the request-headers phase already told you nothing else needs inspection.

Default-deny on phases. Start with everything set to SKIP / NONE and turn on only what you need. A response-body filter that defaults to BUFFERED on a streaming LLM endpoint will silently break SSE and add seconds of latency before anyone notices. Be explicit.

3 · How it's wired in each product

Both products expose the ExtProc gRPC backend via a traffic policy that attaches to an HTTPRoute. agentgateway folds everything (backend + scope) into one AgentgatewayPolicy and ships with sensible streaming defaults; kgateway splits the wiring across a GatewayExtension + a TrafficPolicy and gives you full per-phase control via Envoy-native processingMode. Pick the tab for the product you're on — each has its own knobs (or absence of knobs) to be aware of.

One CRD does the whole wiring. AgentgatewayPolicy (group agentgateway.dev/v1alpha1, the OSS form) attaches to an HTTPRoute via targetRefs and points at your ExtProc Service with a plain backendRef. As of v2026.5.1 the extProc block accepts backendRef and an optional conditional[] for CEL-based backend switching. There is no per-phase opt-in/skip field in this release: the binary streams every phase by default. Setup is minimal and response-body buffering on streaming endpoints cannot occur as a misconfiguration. The trade-off is that phases you don't need cannot yet be skipped on the policy — see the "Coming next release" callout below.

apiVersion: v1
kind: Service
metadata:
  name: redact-extproc
  namespace: agentgateway-system
spec:
  selector: { app: redact-extproc }
  ports:
  - { port: 4444, targetPort: 18080, appProtocol: kubernetes.io/h2c }
---
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
  name: redact
  namespace: agentgateway-system
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind:  HTTPRoute
    name:  llm-openai
  traffic:
    extProc:
      backendRef:
        name: redact-extproc
        port: 4444
      # No processingMode equivalent shipping today; see preview below for the next release.
On the roadmap. A processingOptions block under extProc is planned for a future release, providing the agentgateway analogue of kgateway's processingMode with PascalCase enum values:
traffic:
  extProc:
    backendRef: { name: redact-extproc, port: 4444 }
    processingOptions:
      requestHeaderMode:   Skip            # Send (default) | Skip
      responseHeaderMode:  Send
      requestBodyMode:     None            # FullDuplexStreamed (default) | Buffered | BufferedPartial | None
      responseBodyMode:    None            # Buffered modes cap at 8 KB
      requestTrailerMode:  Skip
      responseTrailerMode: Skip
      allowModeOverride:   false           # honour mode_override responses from the server
Body modes will default to FullDuplexStreamed, so SSE-streaming LLM responses keep working without explicit configuration. Check the agentgateway release notes for availability in your target version before relying on these fields.

Enterprise variant. EnterpriseAgentgatewayPolicy (group enterpriseagentgateway.solo.io/v1alpha1) wraps the same extProc shape and adds the conditional[] list at the policy level for CEL-gated backend switching. Field shape is identical to the OSS form on the same release.

Three resources, in this order: a Service for the ExtProc backend, a GatewayExtension that describes the upstream + processingMode + failure mode, and a TrafficPolicy that wraps the extension and gets attached to specific routes via an ExtensionRef filter. processingMode here is Envoy-native and uses UPPERCASE enum values (NONE / STREAMED / BUFFERED / BUFFERED_PARTIAL). The full surface is configurable, including the response-body buffering behaviour that needs explicit attention on streaming endpoints — see §6 for the recommended defaults.

1 / 3 Deploy the ExtProc gRPC service

Plain Deployment + Service, appProtocol: kubernetes.io/h2c so kgateway speaks HTTP/2 to it.

apiVersion: v1
kind: Service
metadata:
  name: redact-extproc
  namespace: kgateway-system
spec:
  selector: { app: redact-extproc }
  ports:
  - port: 4444
    targetPort: 18080
    protocol: TCP
    appProtocol: kubernetes.io/h2c

2 / 3 GatewayExtension

One resource describes the upstream, the processingMode, and the failure mode.

apiVersion: gateway.kgateway.dev/v1alpha1
kind: GatewayExtension
metadata:
  name: redact
  namespace: kgateway-system
spec:
  type: ExtProc
  extProc:
    grpcService:
      backendRef:
        name: redact-extproc
        port: 4444
    # Scope the stream. Default-deny: turn on only what the server needs.
    processingMode:
      requestHeaderMode:  SKIP            # UPPERCASE enums (Envoy native)
      responseHeaderMode: SEND
      responseBodyMode:   NONE            # NONE | STREAMED | BUFFERED | BUFFERED_PARTIAL
    failOpen: true                        # if ExtProc is down, forward the request unmodified
    messageTimeout: 200ms                 # per-message deadline

3 / 3 EnterpriseKgatewayTrafficPolicy & attach

Wrap the extension in a policy, then reference the policy from an HTTPRoute filter so it scopes to specific routes (not the whole Gateway).

apiVersion: enterprisekgateway.solo.io/v1alpha1
kind: EnterpriseKgatewayTrafficPolicy
metadata:
  name: redact
  namespace: kgateway-system
spec:
  extProc:
    extensionRef:
      name: redact
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: api
  namespace: kgateway-system
spec:
  parentRefs:
  - name: http
  rules:
  - matches:
    - path: { type: PathPrefix, value: /v1 }
    filters:
    - type: ExtensionRef
      extensionRef:
        group: enterprisekgateway.solo.io
        kind:  EnterpriseKgatewayTrafficPolicy
        name:  redact
    backendRefs:
    - name: my-app
      port: 80

4 · Sample, a runnable server

Smallest server that does something useful: strip any response header whose name looks like a credential, and surface the redaction as a counter header. Same logic, shown in Python and Go — pick whichever language fits your stack.

Demo What the filter does — before & after

Same upstream, same path. The only difference is whether the ExtProc filter is attached to the route.

# BEFORE — route /v1 with no ExtProc filter attached
$ curl -si http://gateway.local/v1/echo -H 'Host: api.local'
HTTP/1.1 200 OK
content-type: application/json
x-api-key: sk-live-7f9c2a1e4d8b
authorization: Bearer eyJhbGciOiJIUzI1NiIsInR...
x-internal-secret: rotate-me-2026
server: upstream/1.0

{"msg":"hi"}

# AFTER — same route with the redact ExtensionRef filter attached
$ curl -si http://gateway.local/v1/echo -H 'Host: api.local'
HTTP/1.1 200 OK
content-type: application/json
x-redacted-count: 3
server: upstream/1.0

{"msg":"hi"}

The three credential-shaped headers (x-api-key, authorization, x-internal-secret) are stripped before bytes leave the gateway, replaced with a single x-redacted-count: 3 so monitoring can alert when an upstream starts leaking secrets. The body is untouched (responseBodyMode: NONE in the GatewayExtension), so streaming responses pass through unchanged and no buffering latency is added.

Server extproc.py

import grpc, re
from concurrent import futures
from envoy.service.ext_proc.v3 import (
    external_processor_pb2 as pb,
    external_processor_pb2_grpc as svc,
)
from envoy.config.core.v3.base_pb2 import HeaderValue, HeaderValueOption

KEY_RE = re.compile(r"(api[-_]?key|token|bearer|secret)", re.I)

class Proc(svc.ExternalProcessorServicer):
    def Process(self, request_iterator, ctx):
        for req in request_iterator:
            resp = pb.ProcessingResponse()
            if req.HasField("response_headers"):
                rm = resp.response_headers.response.header_mutation
                removed = 0
                for h in req.response_headers.headers.headers:
                    if KEY_RE.search(h.key):
                        rm.remove_headers.append(h.key)
                        removed += 1
                if removed:
                    rm.set_headers.add(
                        header=HeaderValue(key="x-redacted-count",
                                           raw_value=str(removed).encode())
                    )
            yield resp

if __name__ == "__main__":
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=8))
    svc.add_ExternalProcessorServicer_to_server(Proc(), server)
    server.add_insecure_port("0.0.0.0:18080")
    server.start(); server.wait_for_termination()

Image Dockerfile

Single-stage Alpine build, ~80 MB final.

FROM python:3.12-alpine
RUN pip install --no-cache-dir grpcio envoy-extproc-sdk
COPY extproc.py /app/extproc.py
EXPOSE 18080
CMD ["python", "/app/extproc.py"]

Server extproc.go

Same behaviour, Go flavour. Single static binary, faster cold start than the Python build.

package main

import (
	"io"
	"log"
	"net"
	"regexp"
	"strconv"

	corev3 "github.com/envoyproxy/go-control-plane/envoy/config/core/v3"
	extproc "github.com/envoyproxy/go-control-plane/envoy/service/ext_proc/v3"
	"google.golang.org/grpc"
)

var keyRE = regexp.MustCompile(`(?i)(api[-_]?key|token|bearer|secret)`)

type server struct {
	extproc.UnimplementedExternalProcessorServer
}

func (server) Process(stream extproc.ExternalProcessor_ProcessServer) error {
	for {
		req, err := stream.Recv()
		if err == io.EOF {
			return nil
		}
		if err != nil {
			return err
		}

		resp := &extproc.ProcessingResponse{}
		if rh := req.GetResponseHeaders(); rh != nil {
			mut := &corev3.HeaderMutation{}
			removed := 0
			for _, h := range rh.GetHeaders().GetHeaders() {
				if keyRE.MatchString(h.GetKey()) {
					mut.RemoveHeaders = append(mut.RemoveHeaders, h.GetKey())
					removed++
				}
			}
			if removed > 0 {
				mut.SetHeaders = append(mut.SetHeaders, &corev3.HeaderValueOption{
					Header: &corev3.HeaderValue{
						Key:      "x-redacted-count",
						RawValue: []byte(strconv.Itoa(removed)),
					},
				})
			}
			resp.Response = &extproc.ProcessingResponse_ResponseHeaders{
				ResponseHeaders: &extproc.HeadersResponse{
					Response: &extproc.CommonResponse{HeaderMutation: mut},
				},
			}
		}
		if err := stream.Send(resp); err != nil {
			return err
		}
	}
}

func main() {
	lis, err := net.Listen("tcp", "0.0.0.0:18080")
	if err != nil {
		log.Fatal(err)
	}
	s := grpc.NewServer()
	extproc.RegisterExternalProcessorServer(s, server{})
	log.Println("extproc listening on :18080")
	log.Fatal(s.Serve(lis))
}

Image Dockerfile

Multi-stage build on distroless, ~12 MB final.

FROM golang:1.23-alpine AS build
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o /out/extproc .

FROM gcr.io/distroless/static-debian12
COPY --from=build /out/extproc /extproc
EXPOSE 18080
ENTRYPOINT ["/extproc"]

Build, push to any registry your cluster can reach, then point the Deployment from §3 at the new image. The Service stays the same.

Why envoy-extproc-sdk (Python) and go-control-plane (Go)? They're the maintained protobuf bindings for envoy.service.ext_proc.v3, so you avoid wiring up the full Envoy protoc chain yourself. Plain grpcio / google.golang.org/grpc serves the stream, the SDK gives you the message types.

5 · Verifying

Apply the chart, then curl through the gateway. With the redact filter attached to /v1, any response header whose name matches the credential pattern should disappear and a x-redacted-count header should appear instead.

$ curl -si http://gateway.local/v1/echo \
    -H 'Host: api.local'
HTTP/1.1 200 OK
content-type: application/json
x-redacted-count: 2
...

$ kubectl -n kgateway-system logs deploy/redact-extproc
INFO: Process stream opened
INFO: response_headers: removed api-key, bearer-token (2)
INFO: Process stream closed

Compare against the same call on a route without the ExtensionRef filter, the credential headers come back unchanged. That's the cleanest before/after demo, same upstream, same path prefix, only the filter chain differs.

6 · The processingMode settings

This is where ExtProc deployments live or die. Wrong defaults on processingMode waste latency, break streaming, or starve the server. Read this once before shipping anything.

SettingValuesWhen to change the default
requestHeaderModeDEFAULT SEND SKIP SKIP when the server only inspects responses. Saves one round-trip per request.
requestBodyMode NONE STREAMED BUFFERED BUFFERED_PARTIAL STREAMED for large uploads. BUFFERED only if the server needs the whole body before deciding, and the body is small. BUFFERED_PARTIAL when you want to inspect the first N bytes (typical for content-type sniffing).
responseHeaderModeDEFAULT SEND SKIP SEND if you mutate response headers, SKIP otherwise.
responseBodyMode NONE STREAMED BUFFERED BUFFERED_PARTIAL LLM endpoints: STREAMED only. BUFFERED defeats SSE and the client waits for the whole response — the most common ExtProc misconfiguration on streaming endpoints.
failOpentrue false true for observability filters (don't block traffic on the side-car being down), false for security filters (PII redaction, prompt-guard, signing).
messageTimeoutduration Per-message deadline. Tight (50ms-200ms) for header-only servers, looser (1s+) for body-buffering servers that call out to slow upstreams.
maxMessageTimeoutduration, default 0s (off) Upper bound on per-message timeout overrides the server can set via override_message_timeout. Enable if you trust the server to extend its own deadline.

Watch out Response-body buffering on streaming LLMs

OpenAI-shape /v1/chat/completions with "stream": true returns SSE chunks. If responseBodyMode: BUFFERED is set, Envoy collects every chunk before sending anything to the client. The user sees a long pause, then the entire response at once. Cursor, Continue, and most LLM clients will time out.

Fix: set responseBodyMode: STREAMED, write the server to process each chunk independently. If you genuinely need the whole response before deciding (e.g. a moderation pass that requires final output), apply the policy only to the non-streaming route or non-streaming model variant.

7 · When to reach for ExtProc vs ExtAuth

The two filters look superficially similar (gRPC sidecar called by Envoy on every request), but they occupy different jobs and different points in the filter chain. Pick wrong and you either pay for capability you don't need, or you build code to do something a stock config already handles.

The 30-second heuristic. Does the decision depend on the request body, or do you need to touch the response? → ExtProc.
Is it allow/deny (with optional header injection) based on request attributes only? → ExtAuth, and you can almost certainly do it with stock AuthConfig plugins, no code.
Both apply? → run them both. ExtAuth first, ExtProc second.
Client POST /v1/chat + Bearer token Gateway data plane · agentgateway / kgateway filter chain runs ExtAuth, then ExtProc, then the upstream ExtAuth 1 × Check() RPC Decides allow → forward + inject headers deny → short-circuit (401/403) Sees request attributes method · path · headers · peer ✗ never sees the body ✗ never sees the response Ships with OIDC · OAuth2 · OPA · JWT API-key · LDAP — config, no code ExtProc bidi stream · up to 6 phases Mutates (every enabled phase) req headers · body · trailers resp headers · body · trailers Sees ✓ full request body (stream or buffer) ✓ full response (SSE-aware) can short-circuit at any phase can rewrite route or weights Ships with — nothing. BYO gRPC server (Python / Go / Rust, your code) Upstream MCP · LLM · API request on allow mutated request response (mutate / stream) (response bypasses ExtAuth) on deny → 401/403, upstream never called cyan = ExtAuth path · purple = ExtProc path · amber = denied / not supported · dashed = response or short-circuit

How to read it. One request, two filters. ExtAuth runs first — a single Check() RPC over request attributes. On deny, the gateway short-circuits and the upstream never sees the call (amber dashed). On allow, ExtAuth optionally injects headers and hands off to ExtProc, which opens a bidirectional gRPC stream and can read/mutate any enabled phase on the way out and on the way back. The response only ever goes through ExtProc, never ExtAuth — that's why response-body redaction and LLM-streaming filters have to be ExtProc.

The capability matrix

DimensionExtAuthExtProc
Position in chain Before route selection, before the body is read Through the request/response lifecycle, after route selection
Round-trips per request Exactly 1 (single Check RPC) Up to 6 (one per enabled processingMode phase)
Sees the request body No (attributes only — method, path, headers, peer) Yes, buffered or streamed
Sees the response No Yes — headers, body, trailers, streamed if you want
Can mutate headers Yes, on the request only (via OkResponse) Yes, on every enabled phase, both directions
Can mutate the body No Yes
Can short-circuit the request Yes (DeniedResponse — that's its whole job) Yes (ImmediateResponse at any phase)
Ships with out-of-the-box logic Yes. Solo's ext-auth-service covers OIDC, OAuth2, OPA, API key, basic auth, LDAP, JWT, passthrough — all driven by AuthConfig YAML, no code No. Always a BYO gRPC server (your Python / Go / Rust / whatever).
Typical added latency 1 sidecar hop, usually <5 ms 1 hop per enabled phase. Cheap if you only enable response-headers, expensive if you stream the body
Solo CRD wrapper AuthConfig + RouteOption / EnterpriseKgatewayTrafficPolicy.spec.extAuth GatewayExtension (kgateway) or direct backendRef (agentgateway) + the same traffic-policy CRDs

Use-case picker

NeedUseWhy
Validate a JWT, check an OAuth scope, decide allow/deny ExtAuth One call, no body access needed. The AuthConfig CRD already covers this without writing code.
Inject verified-claim headers (e.g. x-tenant-id) ExtAuth Header injection is the response shape ExtAuth's OkResponse is designed for. See JWT claims to headers.
Redact secrets from a response body ExtProc Needs body access. ExtAuth can't see the body, only request attributes.
Inspect LLM output for PII or prompt-injection markers ExtProc Streamed response body. The Solo agentgateway prompt-guard filter is internally an ExtProc-shaped service.
Semantic cache lookup keyed off the request body ExtProc Need to read the request body, possibly short-circuit with an immediate response from cache.
Sign a response, or strip a trailer based on body content ExtProc Trailer mutation and body inspection both need the streaming protocol.
Route to cluster A vs B based on a request-body field ExtProc Routing decisions on request-body content aren't possible in plain HTTPRoute matchers.
Aggregate two upstream responses into one ExtProc The Solo KB API Aggregation use case.
OPA-policy decision over the full request context ExtAuth+OPA plugin Solo's ext-auth-service ships an OPA plugin out of the box, no ExtProc needed unless you also want body mutation. See Solo ext-auth-service.
Both at once? Yes, the filter chain runs ExtAuth first, then ExtProc. ExtAuth gates whether the request continues, ExtProc mutates what's left. Common pattern for AI gateways: ExtAuth validates the JWT and lifts x-user-id into a header, ExtProc reads that header in its request-body phase and decides whether to allow the prompt through, redact it, or reject with an immediate response.

Where to go next