The short version. On EKS, set
K8S_TOKEN_REVIEW: "true" in the
kagent-enterprise-config ConfigMap and restart the
controller. Everything below is why that setting exists and how to
confirm it took.
The symptom
You install Solo Enterprise for kagent on EKS with OIDC and
on-behalf-of delegation turned on. The front of the flow looks
healthy. A user dispatches a task, the controller validates the
user's OIDC token, maps the caller to a role, and mints an
on-behalf-of token for the agent. The agent starts running. Then
the moment the agent calls back to the controller (to read or update
its own task state, for example) the request returns
401 Unauthorized, and there is nothing useful in the
logs explaining why.
The agent's callback carries its Kubernetes service-account token, not the OIDC token. So the failure is narrow: the controller accepts OIDC tokens but rejects the agent's service-account token. The token is structurally valid. Its issuer, signing key, and audience are all correct. It still gets rejected. That combination is the EKS signature of this problem.
How the controller validates a service-account token
The kagent controller can authenticate a service-account token one of two ways, and the right choice depends on the cluster.
Verify it locally (JWKS)
The controller fetches the cluster's public signing keys, then checks the token's signature and issuer itself. This is the default, and it is the right mode on a cluster whose API server both signs the projected tokens and serves the matching public keys, which covers most self-managed and local clusters.
Wrong fit on EKS
Ask the API server (TokenReview)
Instead of fetching keys, the controller hands the token to the
Kubernetes API server through the TokenReview API and asks
whether it is valid for the kagent audience. The
API server is the authority that issued the token, so it
validates natively, no key fetch involved.
Right fit on EKS
Why local verification is the wrong fit on EKS
EKS does service-account token signing differently from a typical cluster. Each EKS cluster has its own OIDC provider, hosted by AWS at a public URL, and the projected service-account tokens your pods receive are signed by a key published at that provider's JWKS endpoint. The token's issuer points at that AWS-hosted provider, and the public key that verifies it lives there too, on the public internet behind an AWS TLS certificate.
That setup leaves the local-verification path with no good source of keys, for two separate reasons:
- The in-cluster keys can be the wrong keys. The API server's own local key endpoint is reachable from inside the cluster, but on EKS it can advertise a different key than the one the external OIDC provider used to sign the projected token. The controller fetches a key set that does not contain the signing key, so the signature check fails.
- The external keys can be unreachable. Pointing the controller at the AWS-hosted JWKS URL only helps if the controller can actually reach it and trust its certificate. In a locked-down or ambient-mesh network, that outbound path and its public TLS chain are exactly the kind of thing that gets restricted.
So local verification is left choosing between a key set that is unreachable and one that has the wrong key. Neither validates the token, which is why it fails even though the token itself is perfectly good. No JWKS URL, issuer string, or audience override changes that, because the problem is the validation mode, not its parameters.
The fix: TokenReview validation
Switch the controller to TokenReview validation. It then hands each service-account token to the EKS API server and trusts the API server's answer. The API server issued the token, so it validates it natively and sidesteps the whole external-key question. The setting is a single key on the controller's config:
# kagent-enterprise-config ConfigMap, in the controller's namespace
data:
K8S_TOKEN_REVIEW: "true"
Apply it to the kagent-enterprise-config ConfigMap that
the controller reads its validation settings from, then restart the
controller so it picks up the change:
kubectl -n kagent patch configmap kagent-enterprise-config \
--type merge -p '{"data":{"K8S_TOKEN_REVIEW":"true"}}'
kubectl -n kagent rollout restart deploy/kagent-controller
Keep this one in your upgrade runbook: a helm upgrade of
the controller re-renders its config, so re-apply the patch and
restart after an upgrade to keep TokenReview validation in place.
Two things make this work out of the box, with nothing else to wire up:
-
The audience already lines up. TokenReview asks
the API server to validate the token for the
kagentaudience, and the projected token the agent presents already carries that audience. No audience to configure. -
The controller already has permission. The
controller's ClusterRole ships with
createontokenreviews.authentication.k8s.io, so it can call the TokenReview API the moment you turn the setting on.
Confirming the callback path is healthy
After the controller restarts, run the same flow that failed. Dispatch
a task, let the agent start, and watch the agent's callback to the
controller. The request that was returning 401 should now
return 200, and the task should progress to completion
instead of stalling at the first callback.
If it still returns 401 or 403 after the
flip, check that the controller pod actually restarted and re-read the
ConfigMap, and that the ConfigMap value is the string
"true" and not a boolean. Those are the two things that
keep the new mode from taking effect.
Checklist
kagent on EKS, the short version
- Install Solo Enterprise for kagent the usual way. The only EKS-specific change is the validation mode.
- Set
K8S_TOKEN_REVIEW: "true"in thekagent-enterprise-configConfigMap. - Restart the controller so it reloads its validation settings.
- Re-apply the patch and restart after a
helm upgrade, which re-renders the controller config. - No audience or RBAC changes needed: the agent token already uses the
kagentaudience and the controller already hascreateontokenreviews. - Confirm by re-running the flow: the agent's callback returns
200instead of a silent401. - Leave the setting off on kind, k3s, and typical self-managed clusters, where local verification works.