kgateway, agentgateway, Solo
Enterprise for Istio and kagent, because transitive
images (Redis, ext-auth, rate-limiter, model runners) change
between releases. Treat the mechanics below as solid; treat the
image list as version-specific.
The vendor airgap docs typically tell you to override every
image's registry/repository/tag
field via per-product CRs or Helm values. That works for a
single product on a small cluster. As soon as you're installing
the full Solo stack — gateway + mesh + agentgateway + kagent,
often with Gloo Operator managing Istio Ambient and east-west
waypoints — the override surface multiplies, and any miss is an
ImagePullBackOff. The runtime-layer redirect is the
fewer-footguns alternative; the rest of this page is the
mechanics for that.
Why not just override the Helm values?
Both approaches end up pulling from registry.internal.
The difference is how many surfaces you have to touch
and how many places a miss can hide.
Mechanics either way: skopeo copy --all
docker://upstream/path:tag docker://registry.internal/path:tag
preserves digest, multi-arch manifest list and tag — only the
hostname changes. From there:
Helm-values / per-CR overrides — surfaces you have to touch
kgatewayHelm values — controller + data-plane imagesagentgatewayHelm values andEnterpriseAgentgatewayParametersCR — the controller injects the data-plane image from a separate field- Solo Enterprise for Istio: istiod chart + istiod's injection templates (sidecar, ztunnel, waypoint)
- Gloo Operator — the operator that automates Istio Ambient including east-west waypoint provisioning has its own image fields for ztunnel + waypoint, separate from istiod's
kagentHelm values + CR fields for model-runner sidecars- Transitive deps (Redis, ext-auth, rate-limiter) — sometimes parameterised, sometimes not
- Anything a controller generates at runtime that isn't surfaced as a parameter
Runtime-layer redirect — surfaces you have to touch
- One per-node config (containerd
hosts.toml, or CAPIfiles:block, or OpenShift IDMS object — pick one in the next section)
Every pull, from every chart, from every controller, lands on
the mirror automatically. Charts and CRs install with stock
values. Image references in the deployed YAML stay canonical
(us-docker.pkg.dev/...) — SBOMs, Cosign signatures
and audit logs all line up with what Solo published.
What this pattern actually does
The cluster is already in a tightly controlled air-gapped environment — no public-internet path, vetted ingress, audited change control. The mirror pattern doesn't replace any of that; it's a routing control that makes the rest of the install behave correctly inside that environment. Be explicit about what the mirror gives you, and which of the deeper sections cover everything else.
- What the mirror does: redirects every image pull from
us-docker.pkg.dev/gcr.io/docker.iotoregistry.internal, transparently. Charts and CRs install unmodified; references stay canonical so SBOMs, Cosign signatures and audit logs all line up with what Solo published. - What the rest of the article adds on top:
- Provenance — Cosign verify on the pull path (Step 2), platform-signed transfer manifest (Step 2), admission-time verify (Step 6).
- Air-gap enforcement — egress firewall + bypass probe so public registries are demonstrably unreachable, not just assumed unreachable (Step 7).
- Credential handling — TLS / mTLS / private CA, file permissions, secret-injection patterns; the goal is no long-lived plaintext token on any node (Step 4).
- Bootstrap hygiene — Job-per-node with TTL instead of a perpetual privileged DaemonSet, PSA + NetworkPolicy on the bootstrap namespace (Step 5).
- Availability — mirror HA, node-local pre-pull so a brief mirror outage during a rollout doesn't break running pods (HA & DR).
registry.internal can swap an image and
the cluster will run it. Pull-time Cosign (Step 2)
closes the public → mirror gap; admission-time Cosign closes the
mirror → workload gap. Both, not one.
Where the registry lives — and the catch-22
The mirror registry must be reachable before the cluster needs to pull its first image. Two patterns — pick the first one unless an explicit constraint forces otherwise.
Pattern A — external to the cluster
Dedicated Harbor / Quay / Zot / Artifactory host on the
air-gap network. No catch-22: the registry is up before any
cluster node boots, so the kubelet's first pull
(pause, CNI, kube-proxy, the rest) goes straight
through the mirror. This is how every production air-gap
install I've seen is built.
Pattern B — pre-loaded into the node template
Push every image into containerd's content store
at node-image build time — no in-cluster registry, no external
registry on the pull path at runtime.
Pro: faster standup, doesn't rely on an
external registry being reachable when the cluster boots.
Con: disk space — every node carries the full
image set, and the node image grows with every Solo release.
What lives where
Two clean tiers: a private registry outside the cluster, and a
small set of files on every cluster node that point
containerd at it. Nothing else changes.
| Component | Location | Notes |
|---|---|---|
| Private registry (Harbor / Quay / Zot / Artifactory) | External to the cluster, on the air-gap network | One per air-gap network is enough; HA optional |
| All mirrored container images | External registry | Solo images + every transitive dep (Redis, ext-auth, rate-limiter, future additions) |
| Mirrored Helm OCI charts | External registry | helm pull then helm push oci://... |
| Signing / SBOM artifacts | External registry | Cosign sigs, attestations — kept alongside images |
containerd config.toml change |
On every node | One line: config_path = "/etc/containerd/certs.d" |
| /etc/containerd/certs.d/<host>/hosts.toml | On every node | One file per upstream registry being mirrored |
| Mirror TLS CA cert | On every node | Only if the registry uses a private CA |
Pre-loaded bootstrap images (pause, kubelet sidecars) |
On every node | Only if nodes can't reach the registry until kubelet is up |
Pick your delivery path
Three mechanisms put the same mirror configuration onto every node. Option 1 is the underlying containerd config the whole article documents; Options 2 and 3 are convenience layers that generate it automatically for CAPI and OpenShift clusters respectively. The image-set and registry mechanics in Steps 1 and 2 below are identical in every path — only the delivery layer changes.
| Option | Use | API | What ends up on the node |
|---|---|---|---|
| 1 — Containerd configuration (foundation) | Hand-written config.toml + hosts.toml files |
n/a — Packer / Ignition / cloud-init / MachineConfig | The actual /etc/containerd/certs.d/ layout this article documents in Steps 1–7 |
| 2 — Cluster API (vanilla kubeadm under CAPI) | KubeadmConfigTemplate with files: blocks |
Cluster API | Bootstrap provider writes hosts.toml at node init |
| 3 — OpenShift only | ImageDigestMirrorSet + ImageTagMirrorSet |
config.openshift.io/v1 — not upstream Kubernetes |
Machine Config Operator rolls nodes, writes runtime config. No upstream equivalent on vanilla kubeadm. |
All three options ultimately produce the same on-node containerd config. Option 1 is what this article focuses on; Options 2 and 3 are convenience layers that generate the same config automatically for CAPI and OpenShift clusters.
What to do — Containerd configuration
- Enumerate the image set, mirror it with Cosign verification (Steps 1 and 2 below).
- Enable the
config_pathhosts.d layout in/etc/containerd/config.toml(Step 3) — restart containerd once; everything after this is hot-reloaded. - Drop one
hosts.tomlper upstream registry under /etc/containerd/certs.d/<host>/ (Step 4), with TLS / mTLS posture and auth handled per Step 4. - Pick how the files reach every node (Step 5) — Packer / Ignition / cloud-init at node-image build (preferred), MachineConfig / CAPI, or a Job-per-node for Day-0 only.
- Layer admission verification + egress lockdown on top (Steps 6 and 7).
This is the centre of gravity of the article — Steps 1–7 below are this option. Options 2–4 are convenience layers that generate the same on-node config automatically; pick those if you happen to be on a distro that provides them.
What to do — Cluster API
- Add a
KubeadmConfigTemplatewith onefiles:entry per /etc/containerd/certs.d/<host>/hosts.toml. - Reference the CA bundle (and any auth token) via
contentFrom.secret— don't inline credentials in the template body. - Add
preKubeadmCommands: ["systemctl restart containerd"]so the runtime picks upconfig_pathon first boot. - Bind the template to the worker MachinePool. Every new machine gets the same Option-1 files at bootstrap; node replacement is self-healing.
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: KubeadmConfigTemplate
metadata: { name: solo-airgap-workers, namespace: default }
spec:
template:
spec:
files:
- path: /etc/containerd/certs.d/us-docker.pkg.dev/hosts.toml
owner: root:root
permissions: "0644"
content: |
server = "https://us-docker.pkg.dev"
[host."https://registry.internal"]
capabilities = ["pull", "resolve"]
ca = "/etc/containerd/certs.d/mirror-ca.crt"
- path: /etc/containerd/certs.d/mirror-ca.crt
owner: root:root
permissions: "0644"
contentFrom:
secret: { name: mirror-ca-bundle, key: ca.crt }
preKubeadmCommands:
- "systemctl restart containerd"
Convenience layer over Option 1 for CAPI-managed kubeadm clusters. contentFrom.secret keeps secrets out of the template body.
What to do — OpenShift IDMS + ITMS
kubectl applyanImageDigestMirrorSetand anImageTagMirrorSet— most Solo Helm charts reference images by tag, so you need both.- Set
mirrorSourcePolicy: NeverContactSourceon every entry. This is the single most important air-gap flag — fails closed if the mirror is unreachable, no 30 s upstream timeout. - The Machine Config Operator rolls each node automatically, writing the same containerd / CRI-O config Option 1 describes. No file edits.
- Validate:
crictl pullon a representative node after the MCO finishes the roll.
# Digest pulls — safe by default, immutable identity
apiVersion: config.openshift.io/v1
kind: ImageDigestMirrorSet
metadata: { name: solo-airgap-digest }
spec:
imageDigestMirrors:
- source: us-docker.pkg.dev/solo-public/enterprise-agentgateway
mirrors: [registry.internal/solo-public/enterprise-agentgateway]
mirrorSourcePolicy: NeverContactSource
- source: us-docker.pkg.dev/solo-public/istio
mirrors: [registry.internal/solo-public/istio]
mirrorSourcePolicy: NeverContactSource
- source: docker.io/library
mirrors: [registry.internal/dockerhub-library]
mirrorSourcePolicy: NeverContactSource
---
# Tag pulls — needed for Solo Helm charts
apiVersion: config.openshift.io/v1
kind: ImageTagMirrorSet
metadata: { name: solo-airgap-tag }
spec:
imageTagMirrors:
- source: us-docker.pkg.dev/solo-public/enterprise-agentgateway
mirrors: [registry.internal/solo-public/enterprise-agentgateway]
mirrorSourcePolicy: NeverContactSource
OpenShift only. IDMS / ITMS live in config.openshift.io/v1 and are applied by the Machine Config Operator. There is no upstream Kubernetes equivalent CRD or controller — on vanilla kubeadm, kubectl apply-ing these does nothing. If you're not on OpenShift, use Option 1 or 2.
Step 1 Enumerate every image (list of images provided by Solo)
● External · connected admin host
Start from the image list Solo publishes for your version, then
extract the real list from the charts you actually intend to
install — that catches transitive images and image references
the controller generates at runtime.
Nothing touches the cluster nodes in this step —
output is a plain text file (images.txt) on the
admin host.
export VER=2026.5.0
# Pull every chart you intend to install
helm pull oci://us-docker.pkg.dev/solo-public/enterprise-agentgateway/charts/enterprise-agentgateway \
--version $VER --untar
helm pull oci://us-docker.pkg.dev/solo-public/enterprise-agentgateway/charts/enterprise-agentgateway-crds \
--version $VER --untar
# Repeat for kgateway, istio, kagent, gateway-api CRDs, etc.
# Render with realistic values and extract every image reference
helm template enterprise-agentgateway ./enterprise-agentgateway \
-f values.yaml \
| yq -r '.. | .image? // empty' \
| grep -v '^$' | sort -u > images.txt
Some images only appear in the controller's emitted proxy
template (data-plane proxies, shared extensions) — render those
by reading the controller's defaults ConfigMap from
a connected dev cluster:
kubectl get configmap -n agentgateway-system -o yaml \
| grep -E 'image:|repository:' | sort -u >> images.txt
Step 2 Mirror images to the private registry
● External · connected / transfer host → external registry
Use skopeo from a connected host (or a transfer host
with one-way connectivity to both sides).
skopeo copy --all preserves the manifest list so
amd64 and arm64 both work, and copies by digest so you can later
pin by @sha256:.
Still no node-side change — the result lives in
the external registry (registry.internal).
# Direct copy if the connected host can reach both upstream and the air-gap registry
while read img; do
src="docker://$img"
dst="docker://registry.internal/${img}" # keep the upstream path
skopeo copy --all "$src" "$dst"
done < images.txt
# Two-step copy if the connected host has no path to the air-gap network
# 1. On the connected side: copy to a directory
skopeo copy --all docker://us-docker.pkg.dev/solo-public/.../agentgateway-enterprise:$VER \
dir:/transfer/agentgateway-enterprise-$VER
# 2. Sneakernet the directory across the gap
# 3. On the air-gap side: push from the directory to the registry
skopeo copy --all dir:/transfer/agentgateway-enterprise-$VER \
docker://registry.internal/solo-public/.../agentgateway-enterprise:$VER
Mirror the Helm OCI charts the same way (oras copy).
Verify Cosign signatures before the copy
The mirror is your boundary of trust — verify here, then again at
admission (Step 6). Don't let unsigned bits land in
registry.internal.
# Verify, then copy. Fail closed.
while read img; do
cosign verify --key https://solo.io/cosign.pub "$img" > /dev/null \
|| { echo "REJECTED: $img"; exit 1; }
skopeo copy --all "docker://$img" "docker://registry.internal/$img"
cosign copy "$img" "registry.internal/$img" # carry the signature
done < images.txt
Sign the transfer manifest
Capture every digest at mirror time and sign the manifest with the platform-team key. Closes the "did I copy what I think I copied" gap.
# Build the digest inventory, sign it
while read img; do
digest=$(skopeo inspect --no-tags --format '{{.Digest}}' "docker://registry.internal/$img")
printf '%s@%s\n' "${img%:*}" "$digest"
done < images.txt | sort -u > transfer-manifest-$VER.txt
cosign sign-blob --key platform-team.key transfer-manifest-$VER.txt > transfer-manifest-$VER.sig
# On the air-gap side, before any helm install:
cosign verify-blob --key platform-team.pub \
--signature transfer-manifest-$VER.sig transfer-manifest-$VER.txt
Auditable CI/CD pipeline
Laptop skopeo doesn't scale and leaves no audit
trail. Minimum pipeline:
pull → cosign verify → scan (Trivy/Grype) → human
approval gate → push + cosign copy + sign manifest →
log who/when/digest. The approval is the audit
artefact, not the copy.
Step 3 Enable the containerd hosts directory
● On every node —/etc/containerd/config.toml + containerd restart
First node-side change. Edit config.toml on every
node to enable the certs.d directory, then restart
containerd once. After this one-time restart, all
subsequent mirror changes are hot-reloaded per pull.
config_path and the
hosts.d layout went GA in containerd
1.5. On anything older you're on the deprecated
mirrors block inside config.toml — that
works, but requires a daemon restart per change, doesn't support
per-host capabilities, and you should plan an
upgrade rather than build on it.
version = 2
[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d"
Restart containerd once after this change:
sudo systemctl restart containerd
hosts.toml edits do not require a containerd
restart — containerd re-reads them per pull. That is the
main operational payoff of this layout.
Step 4 Drop a hosts.toml for every upstream registry
● On every node — files under/etc/containerd/certs.d/
One directory per source hostname (including non-default ports). The agentgateway image set typically needs three: GAR, GCR, and Docker Hub. No containerd restart needed — these files are picked up on the next pull because Step 3 enabled the hosts.d directory.
/etc/containerd/certs.d/
├── us-docker.pkg.dev/hosts.toml
├── gcr.io/hosts.toml
└── docker.io/hosts.toml
/etc/containerd/certs.d/us-docker.pkg.dev/hosts.toml
server = "https://us-docker.pkg.dev" # upstream fallback; unreachable in air-gap, kept for clarity
[host."https://registry.internal"]
capabilities = ["pull", "resolve"]
# If the mirror uses a private CA:
ca = "/etc/containerd/certs.d/us-docker.pkg.dev/mirror-ca.crt"
# Keep override_path off so containerd preserves the upstream path on the mirror
# (set to true only if your mirror flattens everything under one project)
# override_path = true
/etc/containerd/certs.d/docker.io/hosts.toml — note Docker Hub's real host
server = "https://registry-1.docker.io"
[host."https://registry.internal"]
capabilities = ["pull", "resolve"]
/etc/containerd/certs.d/gcr.io/hosts.toml
server = "https://gcr.io"
[host."https://registry.internal"]
capabilities = ["pull", "resolve"]
TLS posture
- Public CA — no
cafield; node trust store already trusts it. - Private CA (most prod air-gaps) —
ca = "/etc/containerd/certs.d/<host>/mirror-ca.crt". - mTLS (high-assurance) — adds
client = [["client.crt","client.key"]]; client cert is the identity.
# mTLS example
[host."https://registry.internal"]
capabilities = ["pull", "resolve"]
ca = "/etc/containerd/certs.d/us-docker.pkg.dev/mirror-ca.crt"
client = [["/etc/containerd/certs.d/us-docker.pkg.dev/client.crt",
"/etc/containerd/certs.d/us-docker.pkg.dev/client.key"]]
skip_verify = true.
Silently disables TLS verification — any host that wins the
IP race serves images. Fix the CA chain, don't bypass the
check. Block with a node-image lint.
CA rotation: only mirror-ca.crt changes on the node — re-read per pull, no containerd restart. Alert at 30 days before expiry.
Registry auth — no plaintext on nodes
# /etc/containerd/config.toml — MUST be 0600 root:root
[plugins."io.containerd.grpc.v1.cri".registry.configs."registry.internal".auth]
username = "robot$airgap-puller"
password = "<token>"
chown root:root,
chmod 0600. On RHEL-family, verify the SELinux
label with ls -Z. Default 0644 leaks
the token to every UID on the node, forever.
Inject the token via one of:
cloud-init from a KMS-encrypted blob (decrypt with node
instance identity, rotate by re-encrypting) ·
Vault Agent on the node (short-lived leases, native
rotation, renders config.toml from template) ·
sops + age key baked into the node image (rotate by
redeploying the image) ·
sealed-secrets / ESO + privileged DS (cluster-up only,
not a Day-0 path). The biggest regression we see in the field
is "we set it up once and the token never rotated" — pick a
pattern that has rotation in the loop.
Step 5 Choose a delivery mechanism for the node files
● Decides how the Step 3 + Step 4 files reach every node
Steps 3 and 4 describe what sits on each node. This step
is about how it gets there and how it survives node
replacement. This is the deep-dive for the baked
hosts.toml path (Option 4 in the delivery
matrix above). If you picked Option 1, 2 or 3, the distro /
bootstrap layer fans the same files out for you and you can
skip this step.
In order of preference for production durability:
- Bake into the node image — Packer / Ignition / cloud-init at AMI build time. Survives node replacement; no Day-2 reconciliation needed.
- MachineConfig (OpenShift) or KubeadmConfigTemplate / KubeletConfiguration (Cluster API) — declarative, survives node replacement.
-
Privileged DaemonSet writing to
/etc/containerd/certs.d via
hostPath— works on any cluster, fastest to deploy, but the files are lost the moment a node is replaced and the DS hasn't reconciled. Acceptable as a Day-0 bootstrap; not a long-term source of truth.
Hardening the privileged DaemonSet (if you use it)
A privileged: true + hostPID: true pod
mounting the containerd socket can do anything to anything on
the node. Same applies to the upgrade-time pre-pull DS later in
this article. Constrain it:
-
One namespace, PodSecurity
privilegedprofile — give the bootstrap a dedicated namespace (e.g.airgap-bootstrap) labelledpod-security.kubernetes.io/enforce=privileged. Don't let any other workload land there. -
Job-per-node, not a perpetual DS — use a
kind: Jobwith a node affinity per node and anttlSecondsAfterFinishedso the pod evaporates once it's done. Sweep with a CronJob if you want a reconciliation loop. -
Pinned digest + signed image — never
busybox:latest; pin toregistry.internal/utils/busybox@sha256:...and verify the signature at admission (see the Admission-time verification step below). If a CVE lands incrictlorbusybox, an unsigned:latestcan pick up the compromise on the next pod restart. - NetworkPolicy on the namespace allowing only the mirror egress — the bootstrap pod should not be able to reach the cluster API, the cloud metadata service, or anywhere else.
-
RBAC for the ServiceAccount: nothing beyond
what the bootstrap actually needs (no
cluster-admin, nosystem:masters).
crictl turns the bootstrap path into a node-takeover
primitive. Prefer a Job that deletes itself.
Step 6 Admission-time signature verification
● Cluster-side — admission controller (Kyverno / Sigstore policy-controller / Connaisseur)
Step 2 verified signatures on the way into the mirror.
Step 6 verifies signatures on the way out — at pod
admission time. Without this, the mirror is a dumb cache and
anyone with push rights to registry.internal can
silently replace an image. With it, the gateway has two
independent checks: was the image signed by Solo when we
mirrored it (Step 2) and was it signed by Solo
when we ran it (Step 6).
Option A — Kyverno verifyImages
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: verify-solo-images
spec:
validationFailureAction: Enforce
webhookTimeoutSeconds: 30
rules:
- name: verify-solo-cosign
match:
any:
- resources:
kinds: [Pod]
namespaces:
- agentgateway-system
- kgateway-system
- istio-system
- kagent-system
verifyImages:
- imageReferences:
- "registry.internal/solo-public/*"
attestors:
- count: 1
entries:
- keys:
publicKeys: |-
-----BEGIN PUBLIC KEY-----
<Solo's cosign.pub here>
-----END PUBLIC KEY-----
mutateDigest: true # rewrite tag→digest so pods can't drift
verifyDigest: true
required: true
Option B — Sigstore policy-controller (ClusterImagePolicy)
apiVersion: policy.sigstore.dev/v1beta1
kind: ClusterImagePolicy
metadata:
name: verify-solo-images
spec:
images:
- glob: "registry.internal/solo-public/**"
authorities:
- key:
data: |-
-----BEGIN PUBLIC KEY-----
<Solo's cosign.pub here>
-----END PUBLIC KEY-----
Option C — Connaisseur
Connaisseur is the third widely-used option, especially in Notation / TUF shops. Configuration shape is similar — declare the image glob and the trusted public key; failure mode is "admission webhook rejects the pod".
registry.internal/solo-public/**), not the upstream
glob. The mirror is the path images actually arrive on; matching
the upstream glob lets an attacker bypass the policy by pushing
to an unmatched repo.
Tradeoff: admission webhooks add latency to pod
creation (typically < 100 ms with caching). For Kyverno,
failure-policy Fail is the right setting — better to
block deploys than to fail open. Combine with a
webhookTimeoutSeconds high enough to absorb a slow
sigstore lookup, but not so high that a stuck webhook stalls
every Pod create.
Step 7 Egress controls — prove the upstream is unreachable
● Network layer — NetworkPolicy / node firewall / cluster egress gateway
A passing journalctl grep for "no us-docker.pkg.dev"
proves absence-of-evidence, not evidence-of-absence. Enforce at
the network layer that nodes cannot reach public
registries, then actively test that the enforcement holds.
Layer 1 — node-level egress firewall
Most production air-gaps already have this at the perimeter,
but it's worth confirming. Node SGs / VPC firewall rules /
on-prem allow-list should permit egress to
registry.internal only (plus DNS, NTP, OS-update
mirror).
Layer 2 — Kubernetes NetworkPolicy (cluster-internal egress)
# Default-deny egress for the namespaces that host bootstrap / pre-pull pods
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-all-egress
namespace: airgap-bootstrap
spec:
podSelector: {}
policyTypes: [Egress]
egress: []
---
# Then explicitly allow the mirror + DNS
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-mirror-egress
namespace: airgap-bootstrap
spec:
podSelector: {}
policyTypes: [Egress]
egress:
- to:
- ipBlock:
cidr: 10.50.0.0/24 # registry.internal subnet
ports:
- protocol: TCP
port: 443
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-system
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
Layer 3 — active bypass test
Periodically run a probe that tries to reach a public registry from inside the cluster. The test should fail, and the failure mode (NXDOMAIN, connection-refused, route blackhole) tells you which control is actually doing the work:
# Schedule as a CronJob — alerts if the probe ever succeeds
apiVersion: batch/v1
kind: CronJob
metadata:
name: airgap-bypass-probe
namespace: airgap-bootstrap
spec:
schedule: "*/15 * * * *"
jobTemplate:
spec:
template:
spec:
restartPolicy: Never
containers:
- name: probe
image: registry.internal/utils/curl:8.5
command: ["/bin/sh","-c"]
args:
- |
# If ANY of these succeed, the air-gap is leaking
for h in us-docker.pkg.dev gcr.io registry-1.docker.io quay.io; do
if curl --max-time 5 -sSf "https://$h/v2/" > /dev/null 2>&1; then
echo "LEAK: $h reachable from cluster"
exit 2
fi
done
echo "ok — no public registries reachable"
Wire the exit 2 condition to your alerting pipeline.
A leak is a Sev-1: it means an attacker on a compromised
workload can pull from anywhere.
Upgrades and image lifecycle
The hard case is the rolling upgrade — both v1 and
v2 images get pulled from the mirror concurrently
for hours to days. Three rules:
- Serve both N and N-1 for the full rollout. Push
v2, immediately GCv1, and any restart / reschedule / rollback on a not-yet-drained nodeImagePullBackOffs. The most common air-gap upgrade failure. - Surge upgrades amplify the overlap window (
maxSurge > 0or rolling-replacement node pools — new node up before old node drains, both versions pulling at once). - Istio sidecar mode needs its own retention rule — see below.
Istio sidecar retention
Sidecar-mode Istio: existing app pods keep running the
v1 sidecar until the application pod is
restarted. Any v1-sidecar pod that gets rescheduled,
OOM-killed, evicted or drained pulls v1 from the
mirror — if you've already GC'd v1, it
ImagePullBackOffs.
Mirror v1 until every workload has restarted
on v2. Probe:
# Pods still running v1 — must return empty before GC'ing v1
kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.namespace}{"/"}{.metadata.name}{"\t"}{range .spec.containers[?(@.name=="istio-proxy")]}{.image}{"\n"}{end}{end}' \
| grep ':1.25.0$'
Ambient (ztunnel) avoids this — ztunnel is per-node and rolls with the node, so the retention window matches the node rollout instead of the application-restart cadence.
Retention policy
- N (current): always.
- N-1: always during rollout; never delete during an active rollout (rollback depends on it).
- N-2: through the soak period (typically 2–4 weeks after N is fully rolled).
- Older: archive to cold storage rather than delete — storage is cheap, audit trail + CVE forensics are valuable.
Harbor and Quay both enforce retention by tag pattern + count + age. Configure once and stop hand-managing.
Pre-flight: enumerate the new version
helm pull oci://us-docker.pkg.dev/solo-public/.../enterprise-agentgateway \
--version $NEW_VER --untar -d ./new
helm template ./new/enterprise-agentgateway -f values.yaml \
| yq -r '.. | .image? // empty' | sort -u > images-new.txt
# Diff by digest, not by repo:tag — a bumped tag on the same repo
# is a different artefact and a repo:tag diff would miss it.
resolve_digests () {
while read ref; do
digest=$(skopeo inspect --no-tags --format '{{.Digest}}' "docker://$ref")
printf '%s@%s\n' "${ref%:*}" "$digest"
done
}
resolve_digests < images-new.txt | sort -u > images-new-digests.txt
resolve_digests < images-current.txt | sort -u > images-current-digests.txt
comm -23 images-new-digests.txt images-current-digests.txt > images-to-mirror.txt
Mirror images-to-mirror.txt before
the Helm release. The Helm upgrade is then a no-network
operation.
Pre-pull onto nodes (optional but worth it)
For large clusters or fragile mirror links: warm every node's
local content store before rolling, so the rollout itself
doesn't depend on registry availability. A privileged Job-per-node
with ttlSecondsAfterFinished is cleaner than a
perpetual DaemonSet — same effect, no lingering primitive (see
Step 5 hardening).
# Per-node Job — runs once, deletes itself
apiVersion: batch/v1
kind: Job
metadata: { name: prepull-$NODE, namespace: airgap-bootstrap }
spec:
ttlSecondsAfterFinished: 300
template:
spec:
restartPolicy: Never
nodeName: $NODE
containers:
- name: prepull
image: registry.internal/utils/crictl:1.30 # signed, pinned by digest
command: ["/bin/sh","-c","for i in $IMAGES; do crictl pull $i; done"]
env: [{ name: IMAGES, value: "registry.internal/...:$NEW_VER ..." }]
securityContext: { privileged: true }
volumeMounts: [{ name: crisock, mountPath: /run/containerd/containerd.sock }]
volumes: [{ name: crisock, hostPath: { path: /run/containerd/containerd.sock } }]
What does NOT change at upgrade time
The mirror config on each node (hosts.toml / IDMS /
registries.yaml) maps upstream→mirror, not
image→mirror — only the mirror's contents
change at upgrade. The exception is a new upstream registry
(e.g. Solo adds a new GAR project) — that is a node
config change, push it via the same delivery mechanism you used
for the initial setup.
Digest pinning (high-assurance only) & rollback
Tags can be retagged; digests can't. If immutability matters,
resolve every tag to @sha256: at mirror time and pin
Helm/manifest references by digest — accept noisy upgrade diffs
as the cost of supply-chain integrity.
Rollback works iff N-1 is still in the mirror —
helm rollback recovers because v1 images
remain in registry.internal and in most
nodes' local content stores (containerd doesn't GC recently-run
images).
HA and DR for the mirror
Every node depends on the mirror for every pull. "One per air-gap network is enough" is right for steady state, wrong for the failure modes that page someone.
- Topology: single instance (lab only) · active-passive with replicated blob store (most production air-gaps) · active-active behind LB on shared object storage (multi-cluster). Layer the multi-mirror trick in
hosts.tomlon top — listregistry-primary.internalandregistry-secondary.internalin declared order; containerd tries them in order. - Backup both stores: blob store (S3-compatible — versioning + lifecycle) and metadata DB (Harbor Postgres / Zot embed / Artifactory). Restore in lockstep.
- RTO/RPO targets in writing: RTO-read typically 5–15 min on hot standby; RTO-write hours-tolerable; RPO zero with sync replication, daily snapshots typical for object-storage layouts.
- Behaviour when unreachable: image already on the node → runs (local cache); new pull → fails fast (the mirror is configured to fail closed in air-gap — no upstream fallback attempt); flaky mirror +
imagePullPolicy: Always→ tail-latency balloon. - The actual saving grace: node-local pre-pull (Pattern B / DaemonSet warmer) means the mirror only matters for new images and scale-out. Time the pre-pull before any production rollout.
Observability of the mirror itself
The verification step uses journalctl grep. That's
fine on day one for one node. On a 200-node cluster you want this
running continuously, with alerts.
Instrument: mirror hit/miss ratio (registry metrics) ·
pull failure rate by image (registry 4xx/5xx) ·
pull latency p50/p95/p99 by node (probe + access logs) ·
containerd attempts not hitting the mirror
(journalctl / Falco / eBPF) ·
mirror disk utilisation · CA expiry probe ·
admission verification failures (Kyverno / policy-controller).
Minimum alerts (Sev-1 paging unless noted): p99 pull latency > 5 s for 5 min · pull failure rate > 1 % for any image (ticket) · any non-mirror outbound 443 from a cluster node · mirror disk > 80 % · CA expiry < 30 days (ticket) · admission policy rejection of a Solo image.
Verification
A green pull via crictl plus pods Running with their
upstream image references is the success signal:
the runtime-layer mirror is doing its job and the application
layer is unchanged.
Getting onto a node without SSH
Most platform teams don't SSH to nodes any more. Use
kubectl debug node/... to drop a privileged pod onto
the node and chroot /host into the node filesystem:
# Spawn a debug pod scheduled on a specific node
kubectl debug node/<node-name> -it --image=registry.internal/utils/busybox:1.36 -- chroot /host
# Once inside the node fs, run the steps below — crictl, journalctl,
# /etc/containerd/certs.d are all available as if you SSH'd in.
# Alternative: nsenter into containerd's PID namespace via a privileged pod
kubectl run nsenter --rm -it --restart=Never \
--image=registry.internal/utils/busybox:1.36 \
--overrides='{"spec":{"hostPID":true,"containers":[{"name":"x","image":"registry.internal/utils/busybox:1.36","stdin":true,"tty":true,"command":["nsenter","--target","1","--mount","--uts","--ipc","--net","--pid","--","sh"]}]}}'
Scope the debug pod's namespace to one that's covered by a
Pod-Security privileged profile — by default,
kubectl debug node/... creates the pod in
default, which on a hardened cluster will be
rejected by admission.
# 1. Confirm hosts.toml is in place on a representative node
ls /etc/containerd/certs.d/
cat /etc/containerd/certs.d/us-docker.pkg.dev/hosts.toml
# 2. Force a pull through containerd (bypasses kubelet caching)
sudo crictl pull us-docker.pkg.dev/solo-public/enterprise-agentgateway/agentgateway-enterprise:2026.5.0
# 3. Confirm it actually hit the mirror, not the upstream
sudo journalctl -u containerd --since "5 minutes ago" \
| grep -E 'registry.internal|us-docker.pkg.dev'
# Expect lines referencing registry.internal; no outbound 443 attempts to us-docker.pkg.dev
# 4. Install Solo charts with their stock values — no image overrides
helm install enterprise-agentgateway-crds \
oci://us-docker.pkg.dev/solo-public/enterprise-agentgateway/charts/enterprise-agentgateway-crds \
--version $VER -n agentgateway-system --create-namespace
helm install enterprise-agentgateway \
oci://us-docker.pkg.dev/solo-public/enterprise-agentgateway/charts/enterprise-agentgateway \
--version $VER -n agentgateway-system
# NOTE: helm does NOT use containerd's hosts.toml for OCI pulls.
# helm uses oras-go and reads ~/.config/helm/registry/config.json
# + the system trust store. Configure helm registries independently,
# or — simpler — point helm install directly at the mirror:
#
# helm install enterprise-agentgateway \
# oci://registry.internal/solo-public/enterprise-agentgateway/charts/enterprise-agentgateway \
# --version $VER -n agentgateway-system
# 5. Verify pods come up with their original image references intact —
# they should, because the mirror is invisible at the manifest layer
kubectl get pods -n agentgateway-system -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{range .spec.containers[*]}{.image}{"\n"}{end}{end}'
# Expect images like us-docker.pkg.dev/... — and pods Running, not ImagePullBackOff
Pre-upgrade verification
Before bumping a Helm release, confirm both the current and the new
image tags resolve through the mirror — on the same node, in
sequence. This catches the most common upgrade failure (new images
not mirrored yet) before it surfaces as
ImagePullBackOff on a half-drained node.
# Current version still resolves
sudo crictl pull us-docker.pkg.dev/solo-public/.../agentgateway-enterprise:$CURRENT_VER
# New version resolves (proves mirror push succeeded)
sudo crictl pull us-docker.pkg.dev/solo-public/.../agentgateway-enterprise:$NEW_VER
# Optionally: every new image from images-to-mirror.txt
xargs -a images-to-mirror.txt -I{} sudo crictl pull {}
Only proceed with helm upgrade after every line in
images-to-mirror.txt pulls clean.
Related
- Upgrade Solo Enterprise for Agentgateway — apply this mirror config first, then upgrade with stock Helm values.
-
containerd registry configuration reference
— full
hosts.tomlfield reference. -
OpenShift
ImageDigestMirrorSet/ImageTagMirrorSet—config.openshift.io/v1reference. OpenShift only; the Machine Config Operator applies it. No equivalent CRD or controller in upstream Kubernetes. - skopeo copy — multi-arch and digest-preserving image transfer.
Treat this article as a reference shape — the runtime-layer mirror pattern is sound, and the Solo-specific image enumeration step in Step 1 is the one to validate against your install before promoting to production.