Air-gapped image management with containerd registry mirrors, by Tom O'Rourke

⚠️

Validate the image enumeration against your target Solo version. The container-runtime mirror pattern itself is established and in production at many customers. The Solo-specific element that needs lab confirmation per release is the image set — Step 1 should always be rerun against your exact target version of kgateway, agentgateway, Solo Enterprise for Istio and kagent, because transitive images (Redis, ext-auth, rate-limiter, model runners) change between releases. Treat the mechanics below as solid; treat the image list as version-specific.

The vendor airgap docs typically tell you to override every image's registry/repository/tag field via per-product CRs or Helm values. That works for a single product on a small cluster. As soon as you're installing the full Solo stack — gateway + mesh + agentgateway + kagent, often with Gloo Operator managing Istio Ambient and east-west waypoints — the override surface multiplies, and any miss is an ImagePullBackOff. The runtime-layer redirect is the fewer-footguns alternative; the rest of this page is the mechanics for that.

Why not just override the Helm values?

Both approaches end up pulling from registry.internal. The difference is how many surfaces you have to touch and how many places a miss can hide.

Mechanics either way: skopeo copy --all docker://upstream/path:tag docker://registry.internal/path:tag preserves digest, multi-arch manifest list and tag — only the hostname changes. From there:

Helm-values / per-CR overrides — surfaces you have to touch

kgateway Helm values — controller + data-plane images
agentgateway Helm values and EnterpriseAgentgatewayParameters CR — the controller injects the data-plane image from a separate field
Solo Enterprise for Istio: istiod chart + istiod's injection templates (sidecar, ztunnel, waypoint)
Gloo Operator — the operator that automates Istio Ambient including east-west waypoint provisioning has its own image fields for ztunnel + waypoint, separate from istiod's
kagent Helm values + CR fields for model-runner sidecars
Transitive deps (Redis, ext-auth, rate-limiter) — sometimes parameterised, sometimes not
Anything a controller generates at runtime that isn't surfaced as a parameter

Runtime-layer redirect — surfaces you have to touch

One per-node config (containerd hosts.toml, or CAPI files: block, or OpenShift IDMS object — pick one in the next section)

Every pull, from every chart, from every controller, lands on the mirror automatically. Charts and CRs install with stock values. Image references in the deployed YAML stay canonical (us-docker.pkg.dev/...) — SBOMs, Cosign signatures and audit logs all line up with what Solo published.

Rule of thumb: single Solo product, small cluster, no operator-managed dataplane — Helm values are fine. Multi-product install (especially with Gloo Operator managing Ambient + waypoints) — runtime redirect, because every operator-generated proxy / sidecar / waypoint image becomes another override surface you have to find and chase across upgrades.

What this pattern actually does

The cluster is already in a tightly controlled air-gapped environment — no public-internet path, vetted ingress, audited change control. The mirror pattern doesn't replace any of that; it's a routing control that makes the rest of the install behave correctly inside that environment. Be explicit about what the mirror gives you, and which of the deeper sections cover everything else.

What the mirror does: redirects every image pull from us-docker.pkg.dev / gcr.io / docker.io to registry.internal, transparently. Charts and CRs install unmodified; references stay canonical so SBOMs, Cosign signatures and audit logs all line up with what Solo published.
What the rest of the article adds on top:
- Provenance — Cosign verify on the pull path (Step 2), platform-signed transfer manifest (Step 2), admission-time verify (Step 6).
- Air-gap enforcement — egress firewall + bypass probe so public registries are demonstrably unreachable, not just assumed unreachable (Step 7).
- Credential handling — TLS / mTLS / private CA, file permissions, secret-injection patterns; the goal is no long-lived plaintext token on any node (Step 4).
- Bootstrap hygiene — Job-per-node with TTL instead of a perpetual privileged DaemonSet, PSA + NetworkPolicy on the bootstrap namespace (Step 5).
- Availability — mirror HA, node-local pre-pull so a brief mirror outage during a rollout doesn't break running pods (HA & DR).

One thing worth highlighting up front — admission-time signature verification (Step 6) is the control most often missed. Without it, anyone with push rights to registry.internal can swap an image and the cluster will run it. Pull-time Cosign (Step 2) closes the public → mirror gap; admission-time Cosign closes the mirror → workload gap. Both, not one.

Where the registry lives — and the catch-22

The mirror registry must be reachable before the cluster needs to pull its first image. Two patterns — pick the first one unless an explicit constraint forces otherwise.

Pattern A — external to the cluster

Dedicated Harbor / Quay / Zot / Artifactory host on the air-gap network. No catch-22: the registry is up before any cluster node boots, so the kubelet's first pull (pause, CNI, kube-proxy, the rest) goes straight through the mirror. This is how every production air-gap install I've seen is built.

Pattern B — pre-loaded into the node template

Push every image into containerd's content store at node-image build time — no in-cluster registry, no external registry on the pull path at runtime. Pro: faster standup, doesn't rely on an external registry being reachable when the cluster boots. Con: disk space — every node carries the full image set, and the node image grows with every Solo release.

What lives where

Two clean tiers: a private registry outside the cluster, and a small set of files on every cluster node that point containerd at it. Nothing else changes.

Component	Location	Notes
Private registry (Harbor / Quay / Zot / Artifactory)	External to the cluster, on the air-gap network	One per air-gap network is enough; HA optional
All mirrored container images	External registry	Solo images + every transitive dep (Redis, ext-auth, rate-limiter, future additions)
Mirrored Helm OCI charts	External registry	`helm pull` then `helm push oci://...`
Signing / SBOM artifacts	External registry	Cosign sigs, attestations — kept alongside images
`containerd` `config.toml` change	On every node	One line: `config_path = "/etc/containerd/certs.d"`
/etc/containerd/certs.d/<host>/hosts.toml	On every node	One file per upstream registry being mirrored
Mirror TLS CA cert	On every node	Only if the registry uses a private CA
Pre-loaded bootstrap images (`pause`, kubelet sidecars)	On every node	Only if nodes can't reach the registry until kubelet is up

Pick your delivery path

Three mechanisms put the same mirror configuration onto every node. Option 1 is the underlying containerd config the whole article documents; Options 2 and 3 are convenience layers that generate it automatically for CAPI and OpenShift clusters respectively. The image-set and registry mechanics in Steps 1 and 2 below are identical in every path — only the delivery layer changes.

Option	Use	API	What ends up on the node
1 — Containerd configuration (foundation)	Hand-written `config.toml` + `hosts.toml` files	n/a — Packer / Ignition / cloud-init / MachineConfig	The actual `/etc/containerd/certs.d/` layout this article documents in Steps 1–7
2 — Cluster API (vanilla kubeadm under CAPI)	`KubeadmConfigTemplate` with `files:` blocks	Cluster API	Bootstrap provider writes `hosts.toml` at node init
3 — OpenShift only	`ImageDigestMirrorSet` + `ImageTagMirrorSet`	`config.openshift.io/v1` — not upstream Kubernetes	Machine Config Operator rolls nodes, writes runtime config. No upstream equivalent on vanilla kubeadm.

All three options ultimately produce the same on-node containerd config. Option 1 is what this article focuses on; Options 2 and 3 are convenience layers that generate the same config automatically for CAPI and OpenShift clusters.

What to do — Containerd configuration

Enumerate the image set, mirror it with Cosign verification (Steps 1 and 2 below).
Enable the config_path hosts.d layout in /etc/containerd/config.toml (Step 3) — restart containerd once; everything after this is hot-reloaded.
Drop one hosts.toml per upstream registry under /etc/containerd/certs.d/<host>/ (Step 4), with TLS / mTLS posture and auth handled per Step 4.
Pick how the files reach every node (Step 5) — Packer / Ignition / cloud-init at node-image build (preferred), MachineConfig / CAPI, or a Job-per-node for Day-0 only.
Layer admission verification + egress lockdown on top (Steps 6 and 7).

This is the centre of gravity of the article — Steps 1–7 below are this option. Options 2–4 are convenience layers that generate the same on-node config automatically; pick those if you happen to be on a distro that provides them.

What to do — Cluster API

Add a KubeadmConfigTemplate with one files: entry per /etc/containerd/certs.d/<host>/hosts.toml.
Reference the CA bundle (and any auth token) via contentFrom.secret — don't inline credentials in the template body.
Add preKubeadmCommands: ["systemctl restart containerd"] so the runtime picks up config_path on first boot.
Bind the template to the worker MachinePool. Every new machine gets the same Option-1 files at bootstrap; node replacement is self-healing.

apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: KubeadmConfigTemplate
metadata: { name: solo-airgap-workers, namespace: default }
spec:
  template:
    spec:
      files:
      - path: /etc/containerd/certs.d/us-docker.pkg.dev/hosts.toml
        owner: root:root
        permissions: "0644"
        content: |
          server = "https://us-docker.pkg.dev"
          [host."https://registry.internal"]
            capabilities = ["pull", "resolve"]
            ca = "/etc/containerd/certs.d/mirror-ca.crt"
      - path: /etc/containerd/certs.d/mirror-ca.crt
        owner: root:root
        permissions: "0644"
        contentFrom:
          secret: { name: mirror-ca-bundle, key: ca.crt }
      preKubeadmCommands:
      - "systemctl restart containerd"

Convenience layer over Option 1 for CAPI-managed kubeadm clusters. contentFrom.secret keeps secrets out of the template body.

What to do — OpenShift IDMS + ITMS

kubectl apply an ImageDigestMirrorSet and an ImageTagMirrorSet — most Solo Helm charts reference images by tag, so you need both.
Set mirrorSourcePolicy: NeverContactSource on every entry. This is the single most important air-gap flag — fails closed if the mirror is unreachable, no 30 s upstream timeout.
The Machine Config Operator rolls each node automatically, writing the same containerd / CRI-O config Option 1 describes. No file edits.
Validate: crictl pull on a representative node after the MCO finishes the roll.

# Digest pulls — safe by default, immutable identity
apiVersion: config.openshift.io/v1
kind: ImageDigestMirrorSet
metadata: { name: solo-airgap-digest }
spec:
  imageDigestMirrors:
  - source:  us-docker.pkg.dev/solo-public/enterprise-agentgateway
    mirrors: [registry.internal/solo-public/enterprise-agentgateway]
    mirrorSourcePolicy: NeverContactSource
  - source:  us-docker.pkg.dev/solo-public/istio
    mirrors: [registry.internal/solo-public/istio]
    mirrorSourcePolicy: NeverContactSource
  - source:  docker.io/library
    mirrors: [registry.internal/dockerhub-library]
    mirrorSourcePolicy: NeverContactSource
---
# Tag pulls — needed for Solo Helm charts
apiVersion: config.openshift.io/v1
kind: ImageTagMirrorSet
metadata: { name: solo-airgap-tag }
spec:
  imageTagMirrors:
  - source:  us-docker.pkg.dev/solo-public/enterprise-agentgateway
    mirrors: [registry.internal/solo-public/enterprise-agentgateway]
    mirrorSourcePolicy: NeverContactSource

OpenShift only. IDMS / ITMS live in config.openshift.io/v1 and are applied by the Machine Config Operator. There is no upstream Kubernetes equivalent CRD or controller — on vanilla kubeadm, kubectl apply-ing these does nothing. If you're not on OpenShift, use Option 1 or 2.

Step 1 Enumerate every image (list of images provided by Solo)

● External · connected admin host

Start from the image list Solo publishes for your version, then extract the real list from the charts you actually intend to install — that catches transitive images and image references the controller generates at runtime. Nothing touches the cluster nodes in this step — output is a plain text file (images.txt) on the admin host.

export VER=2026.5.0

# Pull every chart you intend to install
helm pull oci://us-docker.pkg.dev/solo-public/enterprise-agentgateway/charts/enterprise-agentgateway \
  --version $VER --untar
helm pull oci://us-docker.pkg.dev/solo-public/enterprise-agentgateway/charts/enterprise-agentgateway-crds \
  --version $VER --untar
# Repeat for kgateway, istio, kagent, gateway-api CRDs, etc.

# Render with realistic values and extract every image reference
helm template enterprise-agentgateway ./enterprise-agentgateway \
  -f values.yaml \
  | yq -r '.. | .image? // empty' \
  | grep -v '^$' | sort -u > images.txt

Some images only appear in the controller's emitted proxy template (data-plane proxies, shared extensions) — render those by reading the controller's defaults ConfigMap from a connected dev cluster:

kubectl get configmap -n agentgateway-system -o yaml \
  | grep -E 'image:|repository:' | sort -u >> images.txt

Step 2 Mirror images to the private registry

● External · connected / transfer host → external registry

Use skopeo from a connected host (or a transfer host with one-way connectivity to both sides). skopeo copy --all preserves the manifest list so amd64 and arm64 both work, and copies by digest so you can later pin by @sha256:. Still no node-side change — the result lives in the external registry (registry.internal).

# Direct copy if the connected host can reach both upstream and the air-gap registry
while read img; do
  src="docker://$img"
  dst="docker://registry.internal/${img}"   # keep the upstream path
  skopeo copy --all "$src" "$dst"
done < images.txt

# Two-step copy if the connected host has no path to the air-gap network
# 1. On the connected side: copy to a directory
skopeo copy --all docker://us-docker.pkg.dev/solo-public/.../agentgateway-enterprise:$VER \
  dir:/transfer/agentgateway-enterprise-$VER
# 2. Sneakernet the directory across the gap
# 3. On the air-gap side: push from the directory to the registry
skopeo copy --all dir:/transfer/agentgateway-enterprise-$VER \
  docker://registry.internal/solo-public/.../agentgateway-enterprise:$VER

Mirror the Helm OCI charts the same way (oras copy).

Verify Cosign signatures before the copy

The mirror is your boundary of trust — verify here, then again at admission (Step 6). Don't let unsigned bits land in registry.internal.

# Verify, then copy. Fail closed.
while read img; do
  cosign verify --key https://solo.io/cosign.pub "$img" > /dev/null \
    || { echo "REJECTED: $img"; exit 1; }
  skopeo copy --all "docker://$img" "docker://registry.internal/$img"
  cosign copy   "$img" "registry.internal/$img"        # carry the signature
done < images.txt

Sign the transfer manifest

Capture every digest at mirror time and sign the manifest with the platform-team key. Closes the "did I copy what I think I copied" gap.

# Build the digest inventory, sign it
while read img; do
  digest=$(skopeo inspect --no-tags --format '{{.Digest}}' "docker://registry.internal/$img")
  printf '%s@%s\n' "${img%:*}" "$digest"
done < images.txt | sort -u > transfer-manifest-$VER.txt

cosign sign-blob --key platform-team.key transfer-manifest-$VER.txt > transfer-manifest-$VER.sig

# On the air-gap side, before any helm install:
cosign verify-blob --key platform-team.pub \
  --signature transfer-manifest-$VER.sig transfer-manifest-$VER.txt

Auditable CI/CD pipeline

Laptop skopeo doesn't scale and leaves no audit trail. Minimum pipeline: pull → cosign verify → scan (Trivy/Grype) → human approval gate → push + cosign copy + sign manifest → log who/when/digest. The approval is the audit artefact, not the copy.

Compliance audits (FedRAMP, IL5, NIS2) want to see who put a digest into the mirror, when, and on the strength of which scan/approval. A pipeline gives you that for free; a laptop doesn't.

Step 3 Enable the containerd hosts directory

● On every node — /etc/containerd/config.toml + containerd restart

First node-side change. Edit config.toml on every node to enable the certs.d directory, then restart containerd once. After this one-time restart, all subsequent mirror changes are hot-reloaded per pull.

Version floor: config_path and the hosts.d layout went GA in containerd 1.5. On anything older you're on the deprecated mirrors block inside config.toml — that works, but requires a daemon restart per change, doesn't support per-host capabilities, and you should plan an upgrade rather than build on it.

version = 2

[plugins."io.containerd.grpc.v1.cri".registry]
  config_path = "/etc/containerd/certs.d"

Restart containerd once after this change:

sudo systemctl restart containerd

After the directory is enabled, subsequent hosts.toml edits do not require a containerd restart — containerd re-reads them per pull. That is the main operational payoff of this layout.

Step 4 Drop a hosts.toml for every upstream registry

● On every node — files under /etc/containerd/certs.d/

One directory per source hostname (including non-default ports). The agentgateway image set typically needs three: GAR, GCR, and Docker Hub. No containerd restart needed — these files are picked up on the next pull because Step 3 enabled the hosts.d directory.

/etc/containerd/certs.d/
├── us-docker.pkg.dev/hosts.toml
├── gcr.io/hosts.toml
└── docker.io/hosts.toml

/etc/containerd/certs.d/us-docker.pkg.dev/hosts.toml

server = "https://us-docker.pkg.dev"   # upstream fallback; unreachable in air-gap, kept for clarity

[host."https://registry.internal"]
  capabilities = ["pull", "resolve"]
  # If the mirror uses a private CA:
  ca = "/etc/containerd/certs.d/us-docker.pkg.dev/mirror-ca.crt"
  # Keep override_path off so containerd preserves the upstream path on the mirror
  # (set to true only if your mirror flattens everything under one project)
  # override_path = true

/etc/containerd/certs.d/docker.io/hosts.toml — note Docker Hub's real host

server = "https://registry-1.docker.io"

[host."https://registry.internal"]
  capabilities = ["pull", "resolve"]

/etc/containerd/certs.d/gcr.io/hosts.toml

server = "https://gcr.io"

[host."https://registry.internal"]
  capabilities = ["pull", "resolve"]

TLS posture

Public CA — no ca field; node trust store already trusts it.
Private CA (most prod air-gaps) — ca = "/etc/containerd/certs.d/<host>/mirror-ca.crt".
mTLS (high-assurance) — adds client = [["client.crt","client.key"]]; client cert is the identity.

# mTLS example
[host."https://registry.internal"]
  capabilities = ["pull", "resolve"]
  ca     = "/etc/containerd/certs.d/us-docker.pkg.dev/mirror-ca.crt"
  client = [["/etc/containerd/certs.d/us-docker.pkg.dev/client.crt",
             "/etc/containerd/certs.d/us-docker.pkg.dev/client.key"]]

🚫

Never set skip_verify = true. Silently disables TLS verification — any host that wins the IP race serves images. Fix the CA chain, don't bypass the check. Block with a node-image lint.

CA rotation: only mirror-ca.crt changes on the node — re-read per pull, no containerd restart. Alert at 30 days before expiry.

Registry auth — no plaintext on nodes

# /etc/containerd/config.toml — MUST be 0600 root:root
[plugins."io.containerd.grpc.v1.cri".registry.configs."registry.internal".auth]
  username = "robot$airgap-puller"
  password = "<token>"

🔒

Permissions: chown root:root, chmod 0600. On RHEL-family, verify the SELinux label with ls -Z. Default 0644 leaks the token to every UID on the node, forever.

Inject the token via one of: cloud-init from a KMS-encrypted blob (decrypt with node instance identity, rotate by re-encrypting) · Vault Agent on the node (short-lived leases, native rotation, renders config.toml from template) · sops + age key baked into the node image (rotate by redeploying the image) · sealed-secrets / ESO + privileged DS (cluster-up only, not a Day-0 path). The biggest regression we see in the field is "we set it up once and the token never rotated" — pick a pattern that has rotation in the loop.

Step 5 Choose a delivery mechanism for the node files

● Decides how the Step 3 + Step 4 files reach every node

Steps 3 and 4 describe what sits on each node. This step is about how it gets there and how it survives node replacement. This is the deep-dive for the baked hosts.toml path (Option 4 in the delivery matrix above). If you picked Option 1, 2 or 3, the distro / bootstrap layer fans the same files out for you and you can skip this step.

In order of preference for production durability:

Bake into the node image — Packer / Ignition / cloud-init at AMI build time. Survives node replacement; no Day-2 reconciliation needed.
MachineConfig (OpenShift) or KubeadmConfigTemplate / KubeletConfiguration (Cluster API) — declarative, survives node replacement.
Privileged DaemonSet writing to /etc/containerd/certs.d via hostPath — works on any cluster, fastest to deploy, but the files are lost the moment a node is replaced and the DS hasn't reconciled. Acceptable as a Day-0 bootstrap; not a long-term source of truth.

Hardening the privileged DaemonSet (if you use it)

A privileged: true + hostPID: true pod mounting the containerd socket can do anything to anything on the node. Same applies to the upgrade-time pre-pull DS later in this article. Constrain it:

One namespace, PodSecurity privileged profile — give the bootstrap a dedicated namespace (e.g. airgap-bootstrap) labelled pod-security.kubernetes.io/enforce=privileged. Don't let any other workload land there.
Job-per-node, not a perpetual DS — use a kind: Job with a node affinity per node and an ttlSecondsAfterFinished so the pod evaporates once it's done. Sweep with a CronJob if you want a reconciliation loop.
Pinned digest + signed image — never busybox:latest; pin to registry.internal/utils/busybox@sha256:... and verify the signature at admission (see the Admission-time verification step below). If a CVE lands in crictl or busybox, an unsigned :latest can pick up the compromise on the next pod restart.
NetworkPolicy on the namespace allowing only the mirror egress — the bootstrap pod should not be able to reach the cluster API, the cloud metadata service, or anywhere else.
RBAC for the ServiceAccount: nothing beyond what the bootstrap actually needs (no cluster-admin, no system:masters).

A privileged-DS-that-lingers is the single most common air-gap bootstrap mistake. The DS goes in to fix Day-0, nobody remembers to remove it, and six months later a CVE in crictl turns the bootstrap path into a node-takeover primitive. Prefer a Job that deletes itself.

Step 6 Admission-time signature verification

● Cluster-side — admission controller (Kyverno / Sigstore policy-controller / Connaisseur)

Step 2 verified signatures on the way into the mirror. Step 6 verifies signatures on the way out — at pod admission time. Without this, the mirror is a dumb cache and anyone with push rights to registry.internal can silently replace an image. With it, the gateway has two independent checks: was the image signed by Solo when we mirrored it (Step 2) and was it signed by Solo when we ran it (Step 6).

Option A — Kyverno `verifyImages`

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: verify-solo-images
spec:
  validationFailureAction: Enforce
  webhookTimeoutSeconds: 30
  rules:
  - name: verify-solo-cosign
    match:
      any:
      - resources:
          kinds: [Pod]
          namespaces:
          - agentgateway-system
          - kgateway-system
          - istio-system
          - kagent-system
    verifyImages:
    - imageReferences:
      - "registry.internal/solo-public/*"
      attestors:
      - count: 1
        entries:
        - keys:
            publicKeys: |-
              -----BEGIN PUBLIC KEY-----
              <Solo's cosign.pub here>
              -----END PUBLIC KEY-----
      mutateDigest: true   # rewrite tag→digest so pods can't drift
      verifyDigest: true
      required: true

Option B — Sigstore policy-controller (`ClusterImagePolicy`)

apiVersion: policy.sigstore.dev/v1beta1
kind: ClusterImagePolicy
metadata:
  name: verify-solo-images
spec:
  images:
  - glob: "registry.internal/solo-public/**"
  authorities:
  - key:
      data: |-
        -----BEGIN PUBLIC KEY-----
        <Solo's cosign.pub here>
        -----END PUBLIC KEY-----

Option C — Connaisseur

Connaisseur is the third widely-used option, especially in Notation / TUF shops. Configuration shape is similar — declare the image glob and the trusted public key; failure mode is "admission webhook rejects the pod".

Whichever you pick, scope the policy to the mirror glob (registry.internal/solo-public/**), not the upstream glob. The mirror is the path images actually arrive on; matching the upstream glob lets an attacker bypass the policy by pushing to an unmatched repo.

Tradeoff: admission webhooks add latency to pod creation (typically < 100 ms with caching). For Kyverno, failure-policy Fail is the right setting — better to block deploys than to fail open. Combine with a webhookTimeoutSeconds high enough to absorb a slow sigstore lookup, but not so high that a stuck webhook stalls every Pod create.

Step 7 Egress controls — prove the upstream is unreachable

● Network layer — NetworkPolicy / node firewall / cluster egress gateway

A passing journalctl grep for "no us-docker.pkg.dev" proves absence-of-evidence, not evidence-of-absence. Enforce at the network layer that nodes cannot reach public registries, then actively test that the enforcement holds.

Layer 1 — node-level egress firewall

Most production air-gaps already have this at the perimeter, but it's worth confirming. Node SGs / VPC firewall rules / on-prem allow-list should permit egress to registry.internal only (plus DNS, NTP, OS-update mirror).

Layer 2 — Kubernetes `NetworkPolicy` (cluster-internal egress)

# Default-deny egress for the namespaces that host bootstrap / pre-pull pods
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all-egress
  namespace: airgap-bootstrap
spec:
  podSelector: {}
  policyTypes: [Egress]
  egress: []

---
# Then explicitly allow the mirror + DNS
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-mirror-egress
  namespace: airgap-bootstrap
spec:
  podSelector: {}
  policyTypes: [Egress]
  egress:
  - to:
    - ipBlock:
        cidr: 10.50.0.0/24   # registry.internal subnet
    ports:
    - protocol: TCP
      port: 443
  - to:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: kube-system
      podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53

Layer 3 — active bypass test

Periodically run a probe that tries to reach a public registry from inside the cluster. The test should fail, and the failure mode (NXDOMAIN, connection-refused, route blackhole) tells you which control is actually doing the work:

# Schedule as a CronJob — alerts if the probe ever succeeds
apiVersion: batch/v1
kind: CronJob
metadata:
  name: airgap-bypass-probe
  namespace: airgap-bootstrap
spec:
  schedule: "*/15 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never
          containers:
          - name: probe
            image: registry.internal/utils/curl:8.5
            command: ["/bin/sh","-c"]
            args:
            - |
              # If ANY of these succeed, the air-gap is leaking
              for h in us-docker.pkg.dev gcr.io registry-1.docker.io quay.io; do
                if curl --max-time 5 -sSf "https://$h/v2/" > /dev/null 2>&1; then
                  echo "LEAK: $h reachable from cluster"
                  exit 2
                fi
              done
              echo "ok — no public registries reachable"

Wire the exit 2 condition to your alerting pipeline. A leak is a Sev-1: it means an attacker on a compromised workload can pull from anywhere.

Why three layers? Defense in depth, and each layer fails for different reasons. Node firewall covers the node; NetworkPolicy covers in-cluster workloads; the probe catches both when a CNI update or an egress-gateway misconfig silently opens a path that nothing else flagged.

Upgrades and image lifecycle

The hard case is the rolling upgrade — both v1 and v2 images get pulled from the mirror concurrently for hours to days. Three rules:

Serve both N and N-1 for the full rollout. Push v2, immediately GC v1, and any restart / reschedule / rollback on a not-yet-drained node ImagePullBackOffs. The most common air-gap upgrade failure.
Surge upgrades amplify the overlap window (maxSurge > 0 or rolling-replacement node pools — new node up before old node drains, both versions pulling at once).
Istio sidecar mode needs its own retention rule — see below.

Istio sidecar retention

Sidecar-mode Istio: existing app pods keep running the v1 sidecar until the application pod is restarted. Any v1-sidecar pod that gets rescheduled, OOM-killed, evicted or drained pulls v1 from the mirror — if you've already GC'd v1, it ImagePullBackOffs. Mirror v1 until every workload has restarted on v2. Probe:

# Pods still running v1 — must return empty before GC'ing v1
kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.namespace}{"/"}{.metadata.name}{"\t"}{range .spec.containers[?(@.name=="istio-proxy")]}{.image}{"\n"}{end}{end}' \
  | grep ':1.25.0$'

Ambient (ztunnel) avoids this — ztunnel is per-node and rolls with the node, so the retention window matches the node rollout instead of the application-restart cadence.

Retention policy

N (current): always.
N-1: always during rollout; never delete during an active rollout (rollback depends on it).
N-2: through the soak period (typically 2–4 weeks after N is fully rolled).
Older: archive to cold storage rather than delete — storage is cheap, audit trail + CVE forensics are valuable.

Harbor and Quay both enforce retention by tag pattern + count + age. Configure once and stop hand-managing.

Pre-flight: enumerate the new version

Pattern B (node-template) caveat — disk doubles up. The new node template needs N plus N-1 bundled in during a rollout. Rough number for a full Solo stack (multi-arch): ~10–15 GB per generation per node → ~20–30 GB per node carrying N + N-1, with a transient spike near 30 GB at cutover. On a 40 GB root disk this is a hard blocker — switch the upgrade window to Pattern A, or size the node template accordingly.

helm pull oci://us-docker.pkg.dev/solo-public/.../enterprise-agentgateway \
  --version $NEW_VER --untar -d ./new
helm template ./new/enterprise-agentgateway -f values.yaml \
  | yq -r '.. | .image? // empty' | sort -u > images-new.txt

# Diff by digest, not by repo:tag — a bumped tag on the same repo
# is a different artefact and a repo:tag diff would miss it.
resolve_digests () {
  while read ref; do
    digest=$(skopeo inspect --no-tags --format '{{.Digest}}' "docker://$ref")
    printf '%s@%s\n' "${ref%:*}" "$digest"
  done
}
resolve_digests < images-new.txt     | sort -u > images-new-digests.txt
resolve_digests < images-current.txt | sort -u > images-current-digests.txt
comm -23 images-new-digests.txt images-current-digests.txt > images-to-mirror.txt

Mirror images-to-mirror.txt before the Helm release. The Helm upgrade is then a no-network operation.

Pre-pull onto nodes (optional but worth it)

For large clusters or fragile mirror links: warm every node's local content store before rolling, so the rollout itself doesn't depend on registry availability. A privileged Job-per-node with ttlSecondsAfterFinished is cleaner than a perpetual DaemonSet — same effect, no lingering primitive (see Step 5 hardening).

# Per-node Job — runs once, deletes itself
apiVersion: batch/v1
kind: Job
metadata: { name: prepull-$NODE, namespace: airgap-bootstrap }
spec:
  ttlSecondsAfterFinished: 300
  template:
    spec:
      restartPolicy: Never
      nodeName: $NODE
      containers:
      - name: prepull
        image: registry.internal/utils/crictl:1.30   # signed, pinned by digest
        command: ["/bin/sh","-c","for i in $IMAGES; do crictl pull $i; done"]
        env: [{ name: IMAGES, value: "registry.internal/...:$NEW_VER ..." }]
        securityContext: { privileged: true }
        volumeMounts: [{ name: crisock, mountPath: /run/containerd/containerd.sock }]
      volumes: [{ name: crisock, hostPath: { path: /run/containerd/containerd.sock } }]

What does NOT change at upgrade time

The mirror config on each node (hosts.toml / IDMS / registries.yaml) maps upstream→mirror, not image→mirror — only the mirror's contents change at upgrade. The exception is a new upstream registry (e.g. Solo adds a new GAR project) — that is a node config change, push it via the same delivery mechanism you used for the initial setup.

Digest pinning (high-assurance only) & rollback

Tags can be retagged; digests can't. If immutability matters, resolve every tag to @sha256: at mirror time and pin Helm/manifest references by digest — accept noisy upgrade diffs as the cost of supply-chain integrity.

Rollback works iff N-1 is still in the mirror — helm rollback recovers because v1 images remain in registry.internal and in most nodes' local content stores (containerd doesn't GC recently-run images).

HA and DR for the mirror

Every node depends on the mirror for every pull. "One per air-gap network is enough" is right for steady state, wrong for the failure modes that page someone.

Topology: single instance (lab only) · active-passive with replicated blob store (most production air-gaps) · active-active behind LB on shared object storage (multi-cluster). Layer the multi-mirror trick in hosts.toml on top — list registry-primary.internal and registry-secondary.internal in declared order; containerd tries them in order.
Backup both stores: blob store (S3-compatible — versioning + lifecycle) and metadata DB (Harbor Postgres / Zot embed / Artifactory). Restore in lockstep.
RTO/RPO targets in writing: RTO-read typically 5–15 min on hot standby; RTO-write hours-tolerable; RPO zero with sync replication, daily snapshots typical for object-storage layouts.
Behaviour when unreachable: image already on the node → runs (local cache); new pull → fails fast (the mirror is configured to fail closed in air-gap — no upstream fallback attempt); flaky mirror + imagePullPolicy: Always → tail-latency balloon.
The actual saving grace: node-local pre-pull (Pattern B / DaemonSet warmer) means the mirror only matters for new images and scale-out. Time the pre-pull before any production rollout.

Observability of the mirror itself

The verification step uses journalctl grep. That's fine on day one for one node. On a 200-node cluster you want this running continuously, with alerts.

Instrument: mirror hit/miss ratio (registry metrics) · pull failure rate by image (registry 4xx/5xx) · pull latency p50/p95/p99 by node (probe + access logs) · containerd attempts not hitting the mirror (journalctl / Falco / eBPF) · mirror disk utilisation · CA expiry probe · admission verification failures (Kyverno / policy-controller).

Minimum alerts (Sev-1 paging unless noted): p99 pull latency > 5 s for 5 min · pull failure rate > 1 % for any image (ticket) · any non-mirror outbound 443 from a cluster node · mirror disk > 80 % · CA expiry < 30 days (ticket) · admission policy rejection of a Solo image.

Verification

A green pull via crictl plus pods Running with their upstream image references is the success signal: the runtime-layer mirror is doing its job and the application layer is unchanged.

Getting onto a node without SSH

Most platform teams don't SSH to nodes any more. Use kubectl debug node/... to drop a privileged pod onto the node and chroot /host into the node filesystem:

# Spawn a debug pod scheduled on a specific node
kubectl debug node/<node-name> -it --image=registry.internal/utils/busybox:1.36 -- chroot /host

# Once inside the node fs, run the steps below — crictl, journalctl,
# /etc/containerd/certs.d are all available as if you SSH'd in.

# Alternative: nsenter into containerd's PID namespace via a privileged pod
kubectl run nsenter --rm -it --restart=Never \
  --image=registry.internal/utils/busybox:1.36 \
  --overrides='{"spec":{"hostPID":true,"containers":[{"name":"x","image":"registry.internal/utils/busybox:1.36","stdin":true,"tty":true,"command":["nsenter","--target","1","--mount","--uts","--ipc","--net","--pid","--","sh"]}]}}'

Scope the debug pod's namespace to one that's covered by a Pod-Security privileged profile — by default, kubectl debug node/... creates the pod in default, which on a hardened cluster will be rejected by admission.

# 1. Confirm hosts.toml is in place on a representative node
ls /etc/containerd/certs.d/
cat /etc/containerd/certs.d/us-docker.pkg.dev/hosts.toml

# 2. Force a pull through containerd (bypasses kubelet caching)
sudo crictl pull us-docker.pkg.dev/solo-public/enterprise-agentgateway/agentgateway-enterprise:2026.5.0

# 3. Confirm it actually hit the mirror, not the upstream
sudo journalctl -u containerd --since "5 minutes ago" \
  | grep -E 'registry.internal|us-docker.pkg.dev'
# Expect lines referencing registry.internal; no outbound 443 attempts to us-docker.pkg.dev

# 4. Install Solo charts with their stock values — no image overrides
helm install enterprise-agentgateway-crds \
  oci://us-docker.pkg.dev/solo-public/enterprise-agentgateway/charts/enterprise-agentgateway-crds \
  --version $VER -n agentgateway-system --create-namespace

helm install enterprise-agentgateway \
  oci://us-docker.pkg.dev/solo-public/enterprise-agentgateway/charts/enterprise-agentgateway \
  --version $VER -n agentgateway-system

# NOTE: helm does NOT use containerd's hosts.toml for OCI pulls.
# helm uses oras-go and reads ~/.config/helm/registry/config.json
# + the system trust store. Configure helm registries independently,
# or — simpler — point helm install directly at the mirror:
#
#   helm install enterprise-agentgateway \
#     oci://registry.internal/solo-public/enterprise-agentgateway/charts/enterprise-agentgateway \
#     --version $VER -n agentgateway-system

# 5. Verify pods come up with their original image references intact —
#    they should, because the mirror is invisible at the manifest layer
kubectl get pods -n agentgateway-system -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{range .spec.containers[*]}{.image}{"\n"}{end}{end}'
# Expect images like us-docker.pkg.dev/... — and pods Running, not ImagePullBackOff

Pre-upgrade verification

Before bumping a Helm release, confirm both the current and the new image tags resolve through the mirror — on the same node, in sequence. This catches the most common upgrade failure (new images not mirrored yet) before it surfaces as ImagePullBackOff on a half-drained node.

# Current version still resolves
sudo crictl pull us-docker.pkg.dev/solo-public/.../agentgateway-enterprise:$CURRENT_VER

# New version resolves (proves mirror push succeeded)
sudo crictl pull us-docker.pkg.dev/solo-public/.../agentgateway-enterprise:$NEW_VER

# Optionally: every new image from images-to-mirror.txt
xargs -a images-to-mirror.txt -I{} sudo crictl pull {}

Only proceed with helm upgrade after every line in images-to-mirror.txt pulls clean.

Upgrade Solo Enterprise for Agentgateway — apply this mirror config first, then upgrade with stock Helm values.
containerd registry configuration reference — full hosts.toml field reference.
OpenShift ImageDigestMirrorSet / ImageTagMirrorSet — config.openshift.io/v1 reference. OpenShift only; the Machine Config Operator applies it. No equivalent CRD or controller in upstream Kubernetes.
skopeo copy — multi-arch and digest-preserving image transfer.

Treat this article as a reference shape — the runtime-layer mirror pattern is sound, and the Solo-specific image enumeration step in Step 1 is the one to validate against your install before promoting to production.