External Processing (ExtProc): streaming request/response mutation on kgateway and agentgateway

ExtAuth answers one question: should this request proceed? ExtProc answers a different one: what should this request or response actually look like? Same external-service pattern, very different surface area. This page walks the six processing phases, the GatewayExtension + traffic-policy wiring on both kgateway and agentgateway, and a small Python (and Go) server you can deploy in a kind cluster in under a minute.

Companion reference: Solo external auth service, the bouncer at the door. ExtProc is what runs after the bouncer waves the request through.

1 · What ExtProc can do

ExtProc is defined by envoy.service.ext_proc.v3.ExternalProcessor. Envoy holds a single bidirectional gRPC stream open per HTTP request and sends a ProcessingRequest message at each enabled phase. Your server replies with a ProcessingResponse that can carry a header mutation, a body mutation, an immediate response (terminate the request with a status code), or a dynamic-metadata write. The stream stays open until the request ends.

Capability	Direction	Notes
Add, set, remove, append headers	req & resp	Same shape as ExtAuth's header mutation, but on every phase.
Replace the body	req & resp	Buffered or streamed. The buffer cap is configurable on the gateway.
Inspect trailers	req & resp	Useful for HTTP/2 gRPC where status arrives in trailers.
Immediate response	any phase	Terminate the request from the ExtProc server (status, headers, body). The way to short-circuit on a guardrail hit.
Dynamic metadata	any phase	Write structured metadata that downstream filters (auth, rate-limit, telemetry) can read.
Dynamic routing decisions	request headers / body	Change the route cluster or override weights before Envoy commits.
Async observability	any phase	Just acknowledge the message, do the work in the background. The stream stays open and Envoy keeps going.

vs ExtAuth. ExtAuth runs before routing, gets request attributes, returns OK or Denied with optional header changes, then it's done. ExtProc runs through the lifecycle, gets bidirectional streaming, sees the body, and can mutate the response. If you need to touch the body, you need ExtProc. If you need to allow or deny, ExtAuth is cheaper.

2 · The processing flow

Envoy can send your server up to six phases per request. You opt into each phase via processingMode on the GatewayExtension. Skip the phases you don't need, every enabled phase is a gRPC round-trip.

REQ Request headers `requestHeaderMode`

Fires after Envoy parses the request line and headers, before route selection. Your reply can mutate headers, override the route cluster, or short-circuit with an immediate response.

REQ Request body `requestBodyMode`

BUFFERED sends the whole body in one message, STREAMED sends it chunk by chunk, BUFFERED_PARTIAL sends up to the configured cap. NONE skips the phase.

REQ Request trailers `requestTrailerMode`

Only fires when the request has trailers (HTTP/2 gRPC, chunked uploads with trailers).

RESP Response headers `responseHeaderMode`

Fires after the upstream responds, before bytes are sent downstream. Same mutation surface as the request side.

RESP Response body `responseBodyMode`

This is the LLM-streaming phase. STREAMED lets you act on SSE chunks as they arrive. BUFFERED defeats streaming, the client sees nothing until the whole response is collected.

RESP Response trailers `responseTrailerMode`

gRPC status lives here. Inspect to record success/failure metrics.

Each enabled phase is a separate ProcessingRequest / ProcessingResponse exchange on the same stream. The server can also tell Envoy "skip the rest" with processing_mode_override, useful when the request-headers phase already told you nothing else needs inspection.

Default-deny on phases. Start with everything set to SKIP / NONE and turn on only what you need. A response-body filter that defaults to BUFFERED on a streaming LLM endpoint will silently break SSE and add seconds of latency before anyone notices. Be explicit.

3 · How it's wired in each product

Both products expose the ExtProc gRPC backend via a traffic policy that attaches to an HTTPRoute. agentgateway folds everything (backend + scope) into one AgentgatewayPolicy and ships with sensible streaming defaults; kgateway splits the wiring across a GatewayExtension + a TrafficPolicy and gives you full per-phase control via Envoy-native processingMode. Pick the tab for the product you're on — each has its own knobs (or absence of knobs) to be aware of.

One CRD does the whole wiring. AgentgatewayPolicy (group agentgateway.dev/v1alpha1, the OSS form) attaches to an HTTPRoute via targetRefs and points at your ExtProc Service with a plain backendRef. As of v2026.5.1 the extProc block accepts backendRef and an optional conditional[] for CEL-based backend switching. There is no per-phase opt-in/skip field in this release: the binary streams every phase by default. Setup is minimal and response-body buffering on streaming endpoints cannot occur as a misconfiguration. The trade-off is that phases you don't need cannot yet be skipped on the policy — see the "Coming next release" callout below.

apiVersion: v1
kind: Service
metadata:
  name: redact-extproc
  namespace: agentgateway-system
spec:
  selector: { app: redact-extproc }
  ports:
  - { port: 4444, targetPort: 18080, appProtocol: kubernetes.io/h2c }
---
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
  name: redact
  namespace: agentgateway-system
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind:  HTTPRoute
    name:  llm-openai
  traffic:
    extProc:
      backendRef:
        name: redact-extproc
        port: 4444
      # No processingMode equivalent shipping today; see preview below for the next release.

On the roadmap. A processingOptions block under extProc is planned for a future release, providing the agentgateway analogue of kgateway's processingMode with PascalCase enum values:

traffic:
  extProc:
    backendRef: { name: redact-extproc, port: 4444 }
    processingOptions:
      requestHeaderMode:   Skip            # Send (default) | Skip
      responseHeaderMode:  Send
      requestBodyMode:     None            # FullDuplexStreamed (default) | Buffered | BufferedPartial | None
      responseBodyMode:    None            # Buffered modes cap at 8 KB
      requestTrailerMode:  Skip
      responseTrailerMode: Skip
      allowModeOverride:   false           # honour mode_override responses from the server

Body modes will default to FullDuplexStreamed, so SSE-streaming LLM responses keep working without explicit configuration. Check the agentgateway release notes for availability in your target version before relying on these fields.

Enterprise variant. EnterpriseAgentgatewayPolicy (group enterpriseagentgateway.solo.io/v1alpha1) wraps the same extProc shape and adds the conditional[] list at the policy level for CEL-gated backend switching. Field shape is identical to the OSS form on the same release.

Three resources, in this order: a Service for the ExtProc backend, a GatewayExtension that describes the upstream + processingMode + failure mode, and a TrafficPolicy that wraps the extension and gets attached to specific routes via an ExtensionRef filter. processingMode here is Envoy-native and uses UPPERCASE enum values (NONE / STREAMED / BUFFERED / BUFFERED_PARTIAL). The full surface is configurable, including the response-body buffering behaviour that needs explicit attention on streaming endpoints — see §6 for the recommended defaults.

1 / 3 Deploy the ExtProc gRPC service

Plain Deployment + Service, appProtocol: kubernetes.io/h2c so kgateway speaks HTTP/2 to it.

apiVersion: v1
kind: Service
metadata:
  name: redact-extproc
  namespace: kgateway-system
spec:
  selector: { app: redact-extproc }
  ports:
  - port: 4444
    targetPort: 18080
    protocol: TCP
    appProtocol: kubernetes.io/h2c

2 / 3 GatewayExtension

One resource describes the upstream, the processingMode, and the failure mode.

apiVersion: gateway.kgateway.dev/v1alpha1
kind: GatewayExtension
metadata:
  name: redact
  namespace: kgateway-system
spec:
  type: ExtProc
  extProc:
    grpcService:
      backendRef:
        name: redact-extproc
        port: 4444
    # Scope the stream. Default-deny: turn on only what the server needs.
    processingMode:
      requestHeaderMode:  SKIP            # UPPERCASE enums (Envoy native)
      responseHeaderMode: SEND
      responseBodyMode:   NONE            # NONE | STREAMED | BUFFERED | BUFFERED_PARTIAL
    failOpen: true                        # if ExtProc is down, forward the request unmodified
    messageTimeout: 200ms                 # per-message deadline

3 / 3 EnterpriseKgatewayTrafficPolicy & attach

Wrap the extension in a policy, then reference the policy from an HTTPRoute filter so it scopes to specific routes (not the whole Gateway).

apiVersion: enterprisekgateway.solo.io/v1alpha1
kind: EnterpriseKgatewayTrafficPolicy
metadata:
  name: redact
  namespace: kgateway-system
spec:
  extProc:
    extensionRef:
      name: redact
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: api
  namespace: kgateway-system
spec:
  parentRefs:
  - name: http
  rules:
  - matches:
    - path: { type: PathPrefix, value: /v1 }
    filters:
    - type: ExtensionRef
      extensionRef:
        group: enterprisekgateway.solo.io
        kind:  EnterpriseKgatewayTrafficPolicy
        name:  redact
    backendRefs:
    - name: my-app
      port: 80

4 · Sample, a runnable server

Smallest server that does something useful: strip any response header whose name looks like a credential, and surface the redaction as a counter header. Same logic, shown in Python and Go — pick whichever language fits your stack.

Demo What the filter does — before & after

Same upstream, same path. The only difference is whether the ExtProc filter is attached to the route.

# BEFORE — route /v1 with no ExtProc filter attached
$ curl -si http://gateway.local/v1/echo -H 'Host: api.local'
HTTP/1.1 200 OK
content-type: application/json
x-api-key: sk-live-7f9c2a1e4d8b
authorization: Bearer eyJhbGciOiJIUzI1NiIsInR...
x-internal-secret: rotate-me-2026
server: upstream/1.0

{"msg":"hi"}

# AFTER — same route with the redact ExtensionRef filter attached
$ curl -si http://gateway.local/v1/echo -H 'Host: api.local'
HTTP/1.1 200 OK
content-type: application/json
x-redacted-count: 3
server: upstream/1.0

{"msg":"hi"}

The three credential-shaped headers (x-api-key, authorization, x-internal-secret) are stripped before bytes leave the gateway, replaced with a single x-redacted-count: 3 so monitoring can alert when an upstream starts leaking secrets. The body is untouched (responseBodyMode: NONE in the GatewayExtension), so streaming responses pass through unchanged and no buffering latency is added.

Server `extproc.py`

import grpc, re
from concurrent import futures
from envoy.service.ext_proc.v3 import (
    external_processor_pb2 as pb,
    external_processor_pb2_grpc as svc,
)
from envoy.config.core.v3.base_pb2 import HeaderValue, HeaderValueOption

KEY_RE = re.compile(r"(api[-_]?key|token|bearer|secret)", re.I)

class Proc(svc.ExternalProcessorServicer):
    def Process(self, request_iterator, ctx):
        for req in request_iterator:
            resp = pb.ProcessingResponse()
            if req.HasField("response_headers"):
                rm = resp.response_headers.response.header_mutation
                removed = 0
                for h in req.response_headers.headers.headers:
                    if KEY_RE.search(h.key):
                        rm.remove_headers.append(h.key)
                        removed += 1
                if removed:
                    rm.set_headers.add(
                        header=HeaderValue(key="x-redacted-count",
                                           raw_value=str(removed).encode())
                    )
            yield resp

if __name__ == "__main__":
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=8))
    svc.add_ExternalProcessorServicer_to_server(Proc(), server)
    server.add_insecure_port("0.0.0.0:18080")
    server.start(); server.wait_for_termination()

Image `Dockerfile`

Single-stage Alpine build, ~80 MB final.

FROM python:3.12-alpine
RUN pip install --no-cache-dir grpcio envoy-extproc-sdk
COPY extproc.py /app/extproc.py
EXPOSE 18080
CMD ["python", "/app/extproc.py"]

Server `extproc.go`

Same behaviour, Go flavour. Single static binary, faster cold start than the Python build.

package main

import (
	"io"
	"log"
	"net"
	"regexp"
	"strconv"

	corev3 "github.com/envoyproxy/go-control-plane/envoy/config/core/v3"
	extproc "github.com/envoyproxy/go-control-plane/envoy/service/ext_proc/v3"
	"google.golang.org/grpc"
)

var keyRE = regexp.MustCompile(`(?i)(api[-_]?key|token|bearer|secret)`)

type server struct {
	extproc.UnimplementedExternalProcessorServer
}

func (server) Process(stream extproc.ExternalProcessor_ProcessServer) error {
	for {
		req, err := stream.Recv()
		if err == io.EOF {
			return nil
		}
		if err != nil {
			return err
		}

		resp := &extproc.ProcessingResponse{}
		if rh := req.GetResponseHeaders(); rh != nil {
			mut := &corev3.HeaderMutation{}
			removed := 0
			for _, h := range rh.GetHeaders().GetHeaders() {
				if keyRE.MatchString(h.GetKey()) {
					mut.RemoveHeaders = append(mut.RemoveHeaders, h.GetKey())
					removed++
				}
			}
			if removed > 0 {
				mut.SetHeaders = append(mut.SetHeaders, &corev3.HeaderValueOption{
					Header: &corev3.HeaderValue{
						Key:      "x-redacted-count",
						RawValue: []byte(strconv.Itoa(removed)),
					},
				})
			}
			resp.Response = &extproc.ProcessingResponse_ResponseHeaders{
				ResponseHeaders: &extproc.HeadersResponse{
					Response: &extproc.CommonResponse{HeaderMutation: mut},
				},
			}
		}
		if err := stream.Send(resp); err != nil {
			return err
		}
	}
}

func main() {
	lis, err := net.Listen("tcp", "0.0.0.0:18080")
	if err != nil {
		log.Fatal(err)
	}
	s := grpc.NewServer()
	extproc.RegisterExternalProcessorServer(s, server{})
	log.Println("extproc listening on :18080")
	log.Fatal(s.Serve(lis))
}

Image `Dockerfile`

Multi-stage build on distroless, ~12 MB final.

FROM golang:1.23-alpine AS build
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o /out/extproc .

FROM gcr.io/distroless/static-debian12
COPY --from=build /out/extproc /extproc
EXPOSE 18080
ENTRYPOINT ["/extproc"]

Build, push to any registry your cluster can reach, then point the Deployment from §3 at the new image. The Service stays the same.

Why envoy-extproc-sdk (Python) and go-control-plane (Go)? They're the maintained protobuf bindings for envoy.service.ext_proc.v3, so you avoid wiring up the full Envoy protoc chain yourself. Plain grpcio / google.golang.org/grpc serves the stream, the SDK gives you the message types.

5 · Verifying

Apply the chart, then curl through the gateway. With the redact filter attached to /v1, any response header whose name matches the credential pattern should disappear and a x-redacted-count header should appear instead.

$ curl -si http://gateway.local/v1/echo \
    -H 'Host: api.local'
HTTP/1.1 200 OK
content-type: application/json
x-redacted-count: 2
...

$ kubectl -n kgateway-system logs deploy/redact-extproc
INFO: Process stream opened
INFO: response_headers: removed api-key, bearer-token (2)
INFO: Process stream closed

Compare against the same call on a route without the ExtensionRef filter, the credential headers come back unchanged. That's the cleanest before/after demo, same upstream, same path prefix, only the filter chain differs.

6 · The processingMode settings

This is where ExtProc deployments live or die. Wrong defaults on processingMode waste latency, break streaming, or starve the server. Read this once before shipping anything.

Setting	Values	When to change the default
`requestHeaderMode`	`DEFAULT` `SEND` `SKIP`	`SKIP` when the server only inspects responses. Saves one round-trip per request.
`requestBodyMode`	`NONE` `STREAMED` `BUFFERED` `BUFFERED_PARTIAL`	`STREAMED` for large uploads. `BUFFERED` only if the server needs the whole body before deciding, and the body is small. `BUFFERED_PARTIAL` when you want to inspect the first N bytes (typical for content-type sniffing).
`responseHeaderMode`	`DEFAULT` `SEND` `SKIP`	`SEND` if you mutate response headers, `SKIP` otherwise.
`responseBodyMode`	`NONE` `STREAMED` `BUFFERED` `BUFFERED_PARTIAL`	LLM endpoints: `STREAMED` only. `BUFFERED` defeats SSE and the client waits for the whole response — the most common ExtProc misconfiguration on streaming endpoints.
`failOpen`	`true` `false`	`true` for observability filters (don't block traffic on the side-car being down), `false` for security filters (PII redaction, prompt-guard, signing).
`messageTimeout`	duration	Per-message deadline. Tight (50ms-200ms) for header-only servers, looser (1s+) for body-buffering servers that call out to slow upstreams.
`maxMessageTimeout`	duration, default `0s` (off)	Upper bound on per-message timeout overrides the server can set via `override_message_timeout`. Enable if you trust the server to extend its own deadline.

Watch out Response-body buffering on streaming LLMs

OpenAI-shape /v1/chat/completions with "stream": true returns SSE chunks. If responseBodyMode: BUFFERED is set, Envoy collects every chunk before sending anything to the client. The user sees a long pause, then the entire response at once. Cursor, Continue, and most LLM clients will time out.

Fix: set responseBodyMode: STREAMED, write the server to process each chunk independently. If you genuinely need the whole response before deciding (e.g. a moderation pass that requires final output), apply the policy only to the non-streaming route or non-streaming model variant.

7 · When to reach for ExtProc vs ExtAuth

The two filters look superficially similar (gRPC sidecar called by Envoy on every request), but they occupy different jobs and different points in the filter chain. Pick wrong and you either pay for capability you don't need, or you build code to do something a stock config already handles.

The 30-second heuristic. Does the decision depend on the request body, or do you need to touch the response? → ExtProc.
Is it allow/deny (with optional header injection) based on request attributes only? → ExtAuth, and you can almost certainly do it with stock AuthConfig plugins, no code.
Both apply? → run them both. ExtAuth first, ExtProc second.

How to read it. One request, two filters. ExtAuth runs first — a single Check() RPC over request attributes. On deny, the gateway short-circuits and the upstream never sees the call (amber dashed). On allow, ExtAuth optionally injects headers and hands off to ExtProc, which opens a bidirectional gRPC stream and can read/mutate any enabled phase on the way out and on the way back. The response only ever goes through ExtProc, never ExtAuth — that's why response-body redaction and LLM-streaming filters have to be ExtProc.

The capability matrix

Dimension	ExtAuth	ExtProc
Position in chain	Before route selection, before the body is read	Through the request/response lifecycle, after route selection
Round-trips per request	Exactly 1 (single `Check` RPC)	Up to 6 (one per enabled `processingMode` phase)
Sees the request body	No (attributes only — method, path, headers, peer)	Yes, buffered or streamed
Sees the response	No	Yes — headers, body, trailers, streamed if you want
Can mutate headers	Yes, on the request only (via `OkResponse`)	Yes, on every enabled phase, both directions
Can mutate the body	No	Yes
Can short-circuit the request	Yes (`DeniedResponse` — that's its whole job)	Yes (`ImmediateResponse` at any phase)
Ships with out-of-the-box logic	Yes. Solo's `ext-auth-service` covers OIDC, OAuth2, OPA, API key, basic auth, LDAP, JWT, passthrough — all driven by `AuthConfig` YAML, no code	No. Always a BYO gRPC server (your Python / Go / Rust / whatever).
Typical added latency	1 sidecar hop, usually <5 ms	1 hop per enabled phase. Cheap if you only enable response-headers, expensive if you stream the body
Solo CRD wrapper	`AuthConfig` + `RouteOption` / `EnterpriseKgatewayTrafficPolicy.spec.extAuth`	`GatewayExtension` (kgateway) or direct `backendRef` (agentgateway) + the same traffic-policy CRDs

Use-case picker

Need	Use	Why
Validate a JWT, check an OAuth scope, decide allow/deny	ExtAuth	One call, no body access needed. The `AuthConfig` CRD already covers this without writing code.
Inject verified-claim headers (e.g. `x-tenant-id`)	ExtAuth	Header injection is the response shape ExtAuth's `OkResponse` is designed for. See JWT claims to headers.
Redact secrets from a response body	ExtProc	Needs body access. ExtAuth can't see the body, only request attributes.
Inspect LLM output for PII or prompt-injection markers	ExtProc	Streamed response body. The Solo agentgateway prompt-guard filter is internally an ExtProc-shaped service.
Semantic cache lookup keyed off the request body	ExtProc	Need to read the request body, possibly short-circuit with an immediate response from cache.
Sign a response, or strip a trailer based on body content	ExtProc	Trailer mutation and body inspection both need the streaming protocol.
Route to cluster A vs B based on a request-body field	ExtProc	Routing decisions on request-body content aren't possible in plain HTTPRoute matchers.
Aggregate two upstream responses into one	ExtProc	The Solo KB API Aggregation use case.
OPA-policy decision over the full request context	ExtAuth+OPA plugin	Solo's `ext-auth-service` ships an OPA plugin out of the box, no ExtProc needed unless you also want body mutation. See Solo ext-auth-service.

Both at once? Yes, the filter chain runs ExtAuth first, then ExtProc. ExtAuth gates whether the request continues, ExtProc mutates what's left. Common pattern for AI gateways: ExtAuth validates the JWT and lifts x-user-id into a header, ExtProc reads that header in its request-body phase and decides whether to allow the prompt through, redact it, or reject with an immediate response.

Where to go next

Solo external auth service, the ExtAuth side of the same story.
JWT claims to HTTP headers, why ExtAuth fronting ExtProc is the pattern, and how the verified-claim handoff works.
kgateway ExtProc docs, official reference including the aggregation and dynamic-routing use cases.
Envoy ExtProc proto, the message-by-message spec.

1 · What ExtProc can do

2 · The processing flow

REQ Request headers requestHeaderMode

REQ Request body requestBodyMode

REQ Request trailers requestTrailerMode

RESP Response headers responseHeaderMode

RESP Response body responseBodyMode

RESP Response trailers responseTrailerMode

3 · How it's wired in each product

1 / 3 Deploy the ExtProc gRPC service

2 / 3 GatewayExtension

3 / 3 EnterpriseKgatewayTrafficPolicy & attach

4 · Sample, a runnable server

Demo What the filter does — before & after

Server extproc.py

Image Dockerfile

Server extproc.go

Image Dockerfile

5 · Verifying

6 · The processingMode settings

Watch out Response-body buffering on streaming LLMs

7 · When to reach for ExtProc vs ExtAuth

The capability matrix

Use-case picker

Where to go next

REQ Request headers `requestHeaderMode`

REQ Request body `requestBodyMode`

REQ Request trailers `requestTrailerMode`

RESP Response headers `responseHeaderMode`

RESP Response body `responseBodyMode`

RESP Response trailers `responseTrailerMode`

Server `extproc.py`

Image `Dockerfile`

Server `extproc.go`

Image `Dockerfile`