Code mode in agentgateway: one run_code tool instead of a wall of MCP tools, by Tom O'Rourke

Point an MCP client at a big API and it gets a big list of tools, one per operation. The whole catalogue is pasted into the model's context every turn, and a job that touches five operations becomes five separate tool calls with the model copying data between them by hand. Code mode turns that around. The same OpenAPI backend is exposed as a single run_code tool whose description is a generated TypeScript API. The client (the model or agent calling the gateway, not the end user) writes one small JavaScript program against that API, the gateway runs it in a sandbox, makes the upstream REST calls, and returns only what the program decided to return. The end user just asks a question in natural language; the model writes the code. This lab puts a petstore behind agentgateway in toolMode: Code on kind and shows the whole thing live: the single tool and its generated API, a raw run_code call you drive by hand, and Claude reading the API and writing the JavaScript itself.

The problem code mode solves

An MCP server that fronts a real API exposes a tool per operation. That is fine for three tools and painful for thirty: every tool schema is in the model's context on every turn, and a task that lists, filters, looks up detail and aggregates becomes a back-and-forth of one tool call per step, with every intermediate result making the full round trip into the model and back out again. The model ends up being the glue code, paying tokens to shuttle JSON it never needed to see.

agentgateway exposes an MCP backend in one of three toolModes. The same petstore OpenAPI looks completely different to the client depending on which one you pick:

toolMode: Standard

default

Every operation is its own tool: addPet, findPetsByStatus, getPetById, deletePet. Simple, but the whole catalogue sits in context and each step is a round trip.

toolMode: Search

progressive disclosure

The client gets get_tool and invoke_tool instead of the full list, and discovers operations on demand. Keeps context small when the tool count is large.

toolMode: Code

run_code

One tool, run_code. Its description is a generated TypeScript API, one async function per operation. The client writes JavaScript; the gateway runs it and makes the calls. This lab.

The flow

Standard mode would put all nineteen of the petstore's operations in the client's context as separate tools and make the client orchestrate a round trip per step. Code mode sends one program, fans the REST calls out inside the gateway, and returns only the answer.

The setup

Four objects on a kind cluster running Solo Enterprise for agentgateway. The OpenAPI document goes in a ConfigMap; the backend turns it into MCP tools and collapses them with toolMode: Code; a Gateway and an HTTPRoute expose the MCP endpoint at /mcp.

yamlyaml/backend.yaml — the code-mode backend

apiVersion: enterpriseagentgateway.solo.io/v1alpha1
kind: EnterpriseAgentgatewayBackend
metadata:
  name: petstore-codemode
  namespace: agentgateway-system
spec:
  entMcp:
    toolMode: Code            # one run_code tool instead of one tool per operation
    codeMode:
      timeout: 10s            # how long a single run_code program may run
    sessionRouting: Stateless
    failureMode: FailClosed
    targets:
      - name: petstore
        static:
          host: petstore3.swagger.io
          port: 443
          protocol: OpenAPI
          policies:
            tls: {}           # the petstore is HTTPS; without this every call 400s
          openAPI:
            schemaRef:
              name: petstore-openapi   # ConfigMap built from the published spec

The backend's schemaRef points at a ConfigMap whose data.schema key holds the API's OpenAPI 3.0 document. You do not write that document. The API publishes its own, and you load the published spec into the ConfigMap as-is. The petstore serves its at /api/v3/openapi.json, so the whole step is one command:

bashbuild the ConfigMap from the published spec

kubectl create configmap petstore-openapi -n agentgateway-system \
  --from-file=schema=<(curl -s https://petstore3.swagger.io/api/v3/openapi.json) \
  --dry-run=client -o yaml | kubectl apply -f -

This is config-time setup, owned by whoever owns the gateway config (a platform team, or the API's owner), on the same lifecycle as the Backend and the Route, and it belongs in git / GitOps. The MCP client never sees it. For an internal API the spec usually comes straight from the framework that serves it (a /openapi.json on a FastAPI or Spring service, say), and a pipeline loads each published version into the ConfigMap; for a third-party API you take the vendor's published spec. Nobody hand-edits the JSON. 03-backend-route.sh runs exactly the command above, falling back to a pinned yaml/petstore-openapi.json when the URL is unreachable (airgap).

Every operation in the spec becomes one function in the generated API. The petstore's published spec has nineteen, so a standard-mode client would see nineteen separate tools; code mode turns all of them into the single run_code tool. Here is a trimmed look at the document that lands in data.schema:

jsonexcerpt of the published petstore openapi.json

{
  "openapi": "3.0.4",
  "info": { "title": "Swagger Petstore - OpenAPI 3.0", "version": "1.0.27" },
  "servers": [ { "url": "/api/v3" } ],
  "paths": {
    "/pet/findByStatus": {
      "get": {
        "operationId": "findPetsByStatus",
        "parameters": [
          { "name": "status", "in": "query",
            "schema": { "type": "string", "default": "available",
                        "enum": ["available", "pending", "sold"] } }
        ]
      }
    },
    "/pet/{petId}": { "get": { "operationId": "getPetById" } }
    // ... 17 more operations: pets, store orders, users ...
  },
  "components": { "schemas": { "Pet": {}, "Order": {}, "User": {} } }
}

yamlyaml/gateway.yaml + yaml/httproute.yaml

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: code-mode-gateway
  namespace: agentgateway-system
spec:
  gatewayClassName: enterprise-agentgateway
  listeners:
    - name: http
      protocol: HTTP
      port: 80
      allowedRoutes: { namespaces: { from: All } }
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: petstore-mcp
  namespace: agentgateway-system
spec:
  parentRefs:
    - name: code-mode-gateway
  rules:
    - matches:
        - path: { type: PathPrefix, value: /mcp }
      backendRefs:
        - group: enterpriseagentgateway.solo.io
          kind: EnterpriseAgentgatewayBackend
          name: petstore-codemode

What the client actually sees

List the tools on the MCP endpoint and there is exactly one, even though the spec has nineteen operations. The tool's description is the contract: the rules for writing the JavaScript, a couple of worked examples, and then the Available API, which is the whole petstore turned into typed async functions. This is what a model reads before it writes anything.

text./scripts/show-tools.sh — captured live

The gateway exposes 1 MCP tool(s):

  • run_code
      input: { code }

run_code description — the generated TypeScript API the client writes against:

Execute code to achieve a goal.

Write JavaScript. The code runs as a top-level script, not inside a function.
Top-level await is available. The final expression becomes the result.
Do not use `return` at top level. ...

Available API:
```js
// Add a new pet to the store.
// type Input = { body: { id?: number, name: string, category?: { id?: number, name?: string }, photoUrls: Array<string>, tags?: Array<({ id?: number, name?: string })>, status?: "available" | "pending" | "sold" } }
async function addPet(input: Input): Promise<unknown>;

// Multiple status values can be provided with comma separated strings.
// type Input = { query: { status: "available" | "pending" | "sold" } }
async function findPetsByStatus(input: Input): Promise<unknown>;

// Returns a single pet.
// type Input = { path: { petId: number } }
async function getPetById(input: Input): Promise<unknown>;

// ... 16 more: findPetsByTags, updatePet, deletePet, getInventory,
//     placeOrder, getOrderById, createUser, loginUser, getUserByName, ...
```

Two things to notice. The parameters are grouped by where they live in the HTTP request, so a path parameter is getPetById({ path: { petId } }) and a body is addPet({ body: { … } }). And the enums survive the round trip from OpenAPI into TypeScript, so the model knows status is one of three strings without guessing. All nineteen functions are in this one tool's description; none of them is a separate tool in the client's context.

Calling run_code directly (no model)

Who writes the JavaScript? In normal use it's the model, not a person. The end user asks a question in natural language and the model reads the generated API and writes the program (that's the next section). This section skips the model on purpose: we hand-write one program ourselves and send it, so you can see exactly what the run_code tool receives and returns with nothing else in the way. It's a plumbing check, not the customer experience.

run-code.sh sends a JavaScript program as the tool's code argument. The program below lists the available pets, groups them by category, fetches detail for the first few in parallel, and returns a small summary. Every await is a REST call the gateway makes to the petstore; the counting and shaping happen inside the sandbox, so only the summary comes back.

javascriptthe program sent to run_code

// (OpenAPI list responses come back wrapped as { data: [...] } - unwrap it.)
const res = await findPetsByStatus({ query: { status: "available" } });
const pets = res.data ?? res;

// Fetch full detail for the first few, in parallel, in the same call.
const sample = pets.filter((p) => Number.isSafeInteger(p.id)).slice(0, 3);
const detailed = await Promise.all(sample.map((p) => getPetById({ path: { petId: p.id } })));

({
  availableCount: pets.length,
  byCategory: pets.reduce((acc, p) => {
    const c = (p.category && p.category.name) || "uncategorised";
    acc[c] = (acc[c] || 0) + 1;
    return acc;
  }, {}),
  sampleDetail: detailed.map((d) => {
    const p = d.data ?? d;
    return { id: p.id, name: p.name, category: (p.category && p.category.name) || null };
  }),
})

jsonrun_code returned — captured live

{
  "success": {
    "availableCount": 134,
    "byCategory": {
      "Dogs": 31,
      "uncategorised": 91,
      "Cats": 1,
      "gen": 1
    },
    "sampleDetail": [
      { "id": 4334, "name": "Biscuit", "category": "Dogs" },
      { "id": 295,  "name": "dens",    "category": null },
      { "id": 233,  "name": "dog",     "category": null }
    ]
  }
}

134 pets and three detail lookups went out from the gateway; one short object came back. A standard-mode client would have pulled the whole 134-pet array into the model's context just to count it. run_code always answers with { "success": … } or { "error": { "message": … } }, so a caller can branch on which key is present.

Letting Claude drive it

Now hand the whole thing to a model. You run one command with a question in natural language:

bashwhat you type

./scripts/ask-llm.sh "How many pets are available, broken down by category?
                      Show me three example available pets with their category."

and you get one answer back:

textwhat you get back (captured live, claude-sonnet-4-6)

Available pets by category:
  Uncategorized   91
  Dogs            31
  狗 (Dogs, zh)    8
  Cats             1
  …
  Total          134

Three examples: Biscuit (Dogs), dens (Uncategorized), zcqAtJBiMX (tcwLeEooaR).

Why the odd rows? The petstore is a shared public sandbox, so its live data is full of other people's test entries. 狗 and 犬类 are just "dog" and "canines" written in Chinese by some other tester, and a pet named zcqAtJBiMX in a category called tcwLeEooaR is random junk someone left behind. The three examples line reads as name (category): Biscuit is in the Dogs category, dens has no category, and zcqAtJBiMX is the junk one. That mess is the point: the model filtered and grouped it inside the gateway and handed back a clean summary, instead of dumping 134 raw records for you to sort out.

That is the whole experience for whoever asks: one question, one answer. They never see run_code, the JavaScript, or the petstore. Everything below is what happened in between, which the script also prints so you can watch it.

What happens in between

ask-llm.sh gives Claude exactly one tool, run_code, with the generated API as its description, and lets it work in a loop. Each step is the same exchange: Claude writes a small JavaScript program and calls run_code with it, the gateway runs the program and returns the result, and Claude reads that result and decides what to do next. It repeats until it can answer, then writes the natural-language reply above. The generated API tells it to inspect an unfamiliar response before trusting it, so here the first few steps are Claude probing the shape, and the last is the real program.

Step 1 · Claude → run_code assumes a plain array

const pets = await findPetsByStatus({ query: { status: "available" } });
pets.slice(0, 3).map(p => ({ id: p.id, name: p.name, category: p.category }));

Step 1 · run_code → Claude the result is not an array

{"error":{"message":"Error: not a function\n    at <eval> (eval_script:3:6)"}}

Step 2 · Claude → run_code tries the full program anyway

const pets = await findPetsByStatus({ query: { status: "available" } });
const categoryCount = {};
for (const pet of pets) { /* ...group by category... */ }
({ total: pets.length, categoryCount });

Step 2 · run_code → Claude still wrong: pets is not iterable

{"error":{"message":"Error: value is not iterable\n    at <eval> (eval_script:2:20)"}}

Step 3 · Claude → run_code stops guessing and inspects the shape

const response = await findPetsByStatus({ query: { status: "available" } });
JSON.stringify(response).slice(0, 500);

Step 3 · run_code → Claude the list is wrapped in { data: [...] }

{"success":"{\"data\":[{\"id\":4334,\"category\":{\"name\":\"Dogs\"},\"name\":\"Biscuit\", ..."}

Step 4 · Claude → run_code now the real program: unwrap, group, sample

const response = await findPetsByStatus({ query: { status: "available" } });
const pets = response.data;                         // <- the fix it just learned

const categoryCount = {};
for (const pet of pets) {
  const c = pet.category?.name || "Uncategorized";
  categoryCount[c] = (categoryCount[c] || 0) + 1;
}
const sorted = Object.entries(categoryCount)
  .sort((a, b) => b[1] - a[1])
  .map(([category, count]) => ({ category, count }));
const examples = pets.slice(0, 3).map(p => ({
  name: p.name, category: p.category?.name || "Uncategorized",
}));

({ total: pets.length, sorted, examples });

Step 4 · run_code → Claude one small summary (Claude turns this into the answer above)

{"success":{"total":134,
  "sorted":[{"category":"Uncategorized","count":91},{"category":"Dogs","count":31},
            {"category":"狗","count":8},{"category":"Cats","count":1}, ...],
  "examples":[{"name":"Biscuit","category":"Dogs"},
              {"name":"dens","category":"Uncategorized"}, ...]}}

Four steps, each one run_code call, and the heavy data never left the gateway: the 134-pet list was counted and grouped inside the sandbox, and what crossed into the model was a 500-character sample to learn the shape and then the small summary. In standard mode the same task is a round trip per list-then-detail step with the full array landing in the model's context each time. The wrong guesses in steps 1 and 2 are the honest part: the model recovers in the same loop, because each result comes straight back to it.

How it runs the code

The JavaScript runs in a sandbox inside the gateway, not in the client and not in the petstore. A program is a top-level script: top-level await is available, the final expression becomes the result, and there is no top-level return. The functions in the generated API are the only way out to the network; a program cannot reach anything the backend did not expose. Each run is bounded by codeMode.timeout from the backend spec, with a memory ceiling and a cap on how many upstream calls one program may make, so a runaway or abusive program fails closed instead of hammering the upstream.

Where the program actually executes: only in the gateway's sandbox. The client writes it and the petstore serves the REST calls, but neither runs the code — so a program can reach nothing the backend did not expose.

Each call is compiled and run in a fresh sandbox: the gateway keeps no cache of programs and no memory of the last one, so the code is generated dynamically every time (any reuse would be the client's own doing). The gateway does not log the program itself, but the generated code is visible at the client where it's written, for example ask-llm.sh prints every program the model sends.

On the public petstore. petstore3.swagger.io is a shared demo and its write path is flaky (addPet was returning 500 while this was captured), so the lab leans on the read and aggregate operations, which is where code mode earns its keep anyway. Two real details show through and are worth keeping: the upstream is HTTPS, so the target needs policies.tls or every call returns 400; and the OpenAPI list response comes back wrapped as { data: [...] }, which is exactly the kind of shape the model is told to inspect before trusting.

Run it yourself

You need Docker, kind, kubectl, helm and uv (for the Python MCP client), a Solo Enterprise for agentgateway license, and an ANTHROPIC_API_KEY for the Claude step. There are two ways to drive the tool, and only the second is what a real user does: run-code.sh lets you hand a JavaScript program to the tool to see the plumbing, and ask-llm.sh is the real flow where you ask in natural language and the model writes and runs the JavaScript for you.

bashquickstart

export AGENTGATEWAY_LICENSE_KEY=...     # Solo Enterprise for agentgateway
export ANTHROPIC_API_KEY=...           # for ask-llm.sh

# bring up kind + agentgateway + the code-mode backend
./scripts/quick.sh up

# what an MCP client sees: one run_code tool + its generated TypeScript API
./scripts/show-tools.sh

# plumbing check (no model): YOU hand a JS program to the tool, get a summary back
./scripts/run-code.sh
./scripts/run-code.sh 'const r = await findPetsByStatus({ query: { status: "sold" } }); (r.data ?? r).length'

# the real flow: you ask in natural language, the MODEL writes + runs the JavaScript
./scripts/ask-llm.sh "which categories have the most available pets?"

./scripts/quick.sh teardown

Observing it

From the operator's side the gateway's logs show the call coming in and the REST calls going out, with more detail as you turn the level up. At the default info level the access log already records every inbound run_code call:

textaccess log (info) — the inbound run_code call

request route=agentgateway-system/petstore-mcp http.path=/mcp http.status=200
  protocol=mcp mcp.method.name=tools/call mcp.target=code_mode
  gen_ai.tool.name=run_code mcp.session.id=… duration=956ms

That tells you run_code ran and how long it took, but not the calls it fanned out to the petstore. Turn the data plane up to debug at runtime through its admin endpoint (no restart) and each upstream REST call the sandbox makes shows up as its own line. observe.sh does the port-forward, sets the level, tails the logs, and resets to info when you stop it:

bashturn the level up and watch

# one terminal: raise the level and tail (Ctrl-C resets it to info)
./scripts/observe.sh debug

# another terminal: make a call
./scripts/run-code.sh

# or by hand against the admin endpoint:
kubectl -n agentgateway-system port-forward <gateway-pod> 15900:15000 &
curl -X POST "http://127.0.0.1:15900/logging?level=debug"    # reset with level=info

textdebug — the call from the gateway to the petstore (one per await)

upstream request target=petstore3.swagger.io:443 endpoint=32.196.215.190:443
  transport=tls http.method=GET http.host=petstore3.swagger.io
  http.path=/api/v3/pet/findByStatus http.version=HTTP/2.0 http.status=200 duration=785ms

At trace you get the full outbound request, query string and headers included (uri: …/api/v3/pet/findByStatus?status=pending). The access log also carries trace.id / span.id, so with OpenTelemetry enabled (the gateway reads the standard OTEL_* env vars) one run_code call becomes a single trace with its petstore calls as child spans, and Prometheus metrics are exposed on the pod's metrics port.

One thing you can't see here. The JavaScript the client sent is not in the gateway logs at any level — request bodies are not logged, and the access log deliberately omits the tool-call arguments. The program body is the client's to capture, which is exactly what run-code.sh and ask-llm.sh already print. So the operator sees that run_code ran and every REST call it caused, but not the code itself.

Extending it

See the contrast. Apply yaml/backend-standard.yaml (the same petstore with no toolMode) and re-run show-tools.sh against it: four separate tools instead of one run_code.
Swap the upstream. The backend is just an OpenAPI target. Point schemaRef at a different document and the generated API changes to match, no client changes needed.
Add a real MCP server. A target can speak StreamableHTTP instead of OpenAPI; code mode then wraps that server's tools as the generated functions.
Put policy in front of it. The MCP endpoint is an ordinary HTTPRoute, so JWT auth, rate limits and the rest of the agentgateway policy surface apply to run_code the same as any route.

Versions

Built and verified on:

Enterprisevalidated 2026-06-18

Solo Enterprise for agentgatewayv2026.5.2

Gateway APIv1.4.0