Solo's product was renamed in 2.12 — what used to be called Gloo Mesh is now
Solo Enterprise for Istio. The management plane on top
(gloo-platform Helm chart, gloo-mesh namespace, meshctl
CLI) keeps those original artifact names — that's still what shows up in your cluster — but
the product name used in prose throughout this page is the new one.
Either path delivers the same wire protocol — Solo Istio Ambient pods talking HBONE on port 15008 to peer east-west gateways, identity in SPIFFE, mTLS everywhere. The choice is about who runs the trust pipeline, who carries the credential material, and who provides the audit trail. The questions that matter aren't about throughput; they're about Day 90 and the compliance audit that follows.
Why this matters
A working multi-cluster mesh is the start of the operator's problem, not the end. Stand-up is a one-time event; Day 2 is forever. The seven production realities below are where the manual path stops scaling — each one is a side-by-side comparison further down the page.
- 1 Root CA rotation. Rare but high-stakes — driven by compromise, cryptographic obsolescence (e.g. moving from RSA-2048 to RSA-4096), organisational change (M&A, trust-domain restructure), or aged key material. Root certs typically live 10–25 years; the rotation needs to happen without breaking in-flight mTLS across every cluster in the mesh.
- 2 Intermediate & workload-cert rotation. Intermediates typically rotate every 1–3 years; workload SPIFFE certs every ~24h by default. Every cluster × every pod × every day on the workload tier — at scale, automation isn't optional.
- 3 Adding a cluster on Day 90. A new region, a new acquisition, a new tenant — without touching the existing N clusters.
- 4 Audit. "Who allowed cluster X to join the mesh on date Y, with what permissions?" — answered in seconds, not days.
- 5 RBAC consistency across N clusters. One policy in mgmt → enforced everywhere, no per-cluster drift.
- 6 Federation across trust domains. M&A — bank acquires fintech, both already on Istio, both with their own root of trust.
- 7 Observability. A single pane of glass across the mesh — vs N Grafanas, N Kialis, N sets of dashboards that drift.
Side-by-side scenarios
For each scenario, the amber column is what the operator actually types on the manual path; the green column is what the operator types when the Solo Enterprise for Istio management plane is in front of the mesh. Same data plane underneath — different operator experience on top.
Bootstrap a new cluster into an existing 3-cluster mesh
You already have a working three-cluster ambient mesh. A new region (call it
prod-eu-west-3) gets approved. The fourth kubeconfig lands on your laptop. What
happens between "kubeconfig arrives" and "first workload accepts cross-cluster traffic"?
About — what's the real cost?
The honest cost on the manual path: roughly 20–30 minutes per new cluster in a 3-existing-cluster mesh, assuming everything works first time. The dominant cost isn't the typing — it's the private key material being copied over scp (root CA private key) and the N×(N−1) peer exchange: every existing cluster needs the new cluster's peer bundle, and the new cluster needs everyone else's. That copying is the operation a compliance auditor will ask you about in twelve months.
The honest cost on the management-plane path: roughly 60 seconds, because cluster registration is one CR (`KubernetesCluster` in the mgmt namespace) plus a service-account token bound to it. The mgmt plane already holds the trust anchor; it issues a per-cluster intermediate and pushes it via the agent running on the new cluster. No private key crosses your laptop.
What the operator actually types
Generate a new intermediate from the shared root CA, copy root-ca.crt and
root-ca.key to the new host, install the mesh, expose the east-west GW, and
cross-apply remote-secrets 2×N times (every existing cluster ↔ the new
cluster).
# 1. On a trusted ops host: mint a new intermediate from the shared root
./gen-intermediate.sh prod-eu-west-3 \
--root-ca-cert ~/safe/root-ca.crt \
--root-ca-key ~/safe/root-ca.key \
--out ./prod-eu-west-3-cacerts
# 2. scp the cacerts Secret material onto the new cluster
kubectl --context prod-eu-west-3 create ns istio-system
kubectl --context prod-eu-west-3 -n istio-system create secret generic cacerts \
--from-file=./prod-eu-west-3-cacerts/
# 3. Stand up Istio Ambient + east-west GW + peering chart on the new cluster
helm install istio-base ... ; helm install istiod-gloo ... ;
helm install istio-cni ... ; helm install ztunnel ... ;
helm install peering ... # exposes the istio-eastwest GW
# 4. Cross-apply remote-secrets — N*(N-1) operations
for SRC in cluster-1 cluster-2 cluster-3 prod-eu-west-3; do
for DST in cluster-1 cluster-2 cluster-3 prod-eu-west-3; do
[ "$SRC" = "$DST" ] && continue
istioctl create-remote-secret --context $SRC --name $SRC \
| kubectl --context $DST apply -f -
done
done
# 5. Restart istiod everywhere so it re-reads the new remote-secrets
for C in cluster-1 cluster-2 cluster-3 prod-eu-west-3; do
kubectl --context $C -n istio-system rollout restart deploy/istiod-gloo
done
apply operations for a 4-cluster final state, plus a
rolling istiod restart that briefly stalls xDS for in-flight pods.
What the operator actually types
Register the kubeconfig with the management plane. The mgmt agent on the new cluster handles intermediate-CA issuance and peer-secret distribution; the existing clusters are not touched.
# Single command on the ops host. The mgmt plane already trusts the
# root CA and issues a per-cluster intermediate. The agent installed
# on the new cluster pulls config from the mgmt plane.
meshctl cluster register prod-eu-west-3 \
--mgmt-context mgmt \
--remote-context prod-eu-west-3 \
--version 2.12.0
# Verify
kubectl --context mgmt -n gloo-mesh get kubernetescluster prod-eu-west-3
# NAME AGE AGENT-VERSION STATUS
# prod-eu-west-3 34s 2.12.0 Connected
Rotate the root CA
Root CAs live a long time (10–25 years is typical), so this isn't a routine cadence — but when it has to happen, it has to happen without breaking in-flight mTLS across every cluster in the mesh. Real triggers: compromise / breach response, cryptographic obsolescence (e.g. RSA-2048 → RSA-4096 or an algorithm change), M&A trust-domain restructure, or aged key material approaching its policy limit. The rotation runs through a dual-trust window where both old and new roots are accepted simultaneously, then the old one is retired.
About — what's the real cost?
Why this is the scariest scenario on the manual path: if you update one
cluster's cacerts Secret to use a new intermediate signed by the new root, but a
peer cluster hasn't received the new root in its ca-cert.pem chain yet, that peer
will reject the SVID it receives over HBONE — and you've just broken cross-cluster mTLS
mid-rotation. There is no rollback button that's fast enough; you have to manually re-sequence.
What the management plane gives you: the rotation is orchestrated.
The mgmt plane knows the topology, knows which clusters trust which roots, and rolls the change
through them in an order it has proven safe. It also integrates upstream — your RootTrustPolicy
can source the new root from Hashicorp Vault or AWS Private CA
rather than a file you generated by hand. That's the integration most enterprise PKI teams
actually need to sign off on the design.
What the operator actually types
Generate a new root, regenerate every cluster's intermediate, distribute via cacerts
Secret update across all N clusters in lockstep (otherwise mTLS breaks
mid-rotation), restart istiod everywhere, hope nobody's mid-handshake.
# 1. Generate the new root CA (on a trusted ops host)
./gen-root-ca.sh --out ~/safe/root-ca-2026Q2
# 2. For each cluster: regenerate the intermediate signed by the NEW root,
# but keep the OLD root in the trust chain during the transition
for C in cluster-1 cluster-2 cluster-3; do
./gen-intermediate.sh $C \
--root-ca-cert ~/safe/root-ca-2026Q2/root-ca.crt \
--root-ca-key ~/safe/root-ca-2026Q2/root-ca.key \
--out ./tmp/$C
# Patch the cacerts Secret with BOTH roots in ca-cert.pem (old + new)
cat ~/safe/root-ca-2025Q1/root-ca.crt \
~/safe/root-ca-2026Q2/root-ca.crt \
> ./tmp/$C/ca-cert.pem
kubectl --context $C -n istio-system create secret generic cacerts \
--from-file=./tmp/$C/ --dry-run=client -o yaml \
| kubectl --context $C -n istio-system apply -f -
done
# 3. Rolling istiod restart so it picks up the new intermediate
for C in cluster-1 cluster-2 cluster-3; do
kubectl --context $C -n istio-system rollout restart deploy/istiod-gloo
kubectl --context $C -n istio-system rollout status deploy/istiod-gloo
done
# 4. Wait for every workload pod to have re-issued an SVID from the new int
# (default rotation period ~24h). Only then is it safe to remove the OLD root.
# 5. Drop the OLD root from ca-cert.pem and re-apply cacerts everywhere
What the operator actually types
Rotate the root in the mgmt plane's RootTrustPolicy CR (or upstream in Vault /
AWS Private CA, which the mgmt plane reads as its source-of-truth). The mgmt plane
orchestrates rolling rotation across all registered clusters.
# With Vault as the upstream root of trust:
kubectl --context mgmt -n gloo-mesh apply -f - <<'EOF'
apiVersion: admin.gloo.solo.io/v2
kind: RootTrustPolicy
metadata:
name: root-trust
namespace: gloo-mesh
spec:
config:
mgmtServerCa:
generated: {}
intermediateCertOptions:
secretRotationGracePeriodRatio: 0.5 # auto re-issue intermediates at 50% TTL
autoRestartPods: true # gracefully roll workloads after rotation
agentCa:
vault: # source of truth = Vault
caPath: pki_root/ca
csrPath: pki_root/sign/intermediate
server: https://vault.example.com:8200
kubernetesAuth:
role: gloo-mesh-mgmt
EOF
# When the auditor asks "was the rotation successful on cluster-3?"
kubectl --context mgmt -n gloo-mesh get kubernetescluster cluster-3 -o yaml \
| yq '.status.caStatus'
RootTrustPolicy CR
above is the long-standing Gloo-Mesh-managed PKI primitive. Field names and the exact upstream
integrations (Vault, AWS Private CA, cert-manager) evolve across Solo Enterprise for Istio
minor versions — always check
the current docs
before writing the YAML you'll commit.
Rotate the remote-secret tokens
The kubeconfig Secrets that one cluster's istiod uses to discover the others — the
istio-remote-secret-* Secrets in istio-system — are bearer tokens.
Compliance frameworks treat them like any other API credential: rotate on a schedule, rotate on
personnel change, rotate on suspected compromise.
About — what's the real cost?
The hidden cost on the manual path is that nobody actually does this
rotation, because doing it correctly means N×(N−1) coordinated kubectl applys and
a per-token validity check. So in practice the tokens just live forever, which is exactly the
failure mode the rotation policy was supposed to prevent.
What the management plane changes: the mgmt plane re-issues the trust material via its agent-server pipeline. No long-lived bearer token sits in a kubeconfig in every cluster — the relay agents authenticate to the mgmt plane via mTLS using identities the mgmt plane itself issues.
What the operator actually types
Per-cluster: regenerate the bound service-account token, build a new remote-secret YAML,
kubectl apply on every peer, verify istiod picked it up.
N×(N−1) cross-applies.
# For each "source" cluster, rebuild its istio-reader token,
# then apply the resulting kubeconfig Secret on every PEER cluster.
for SRC in cluster-1 cluster-2 cluster-3; do
# Re-create the bound token (the old one keeps working until TTL expires)
kubectl --context $SRC -n istio-system create token istio-reader-service-account \
--duration 8760h > /tmp/$SRC.token
# Rebuild the kubeconfig Secret YAML with the new token
./build-remote-secret.sh \
--context $SRC \
--token /tmp/$SRC.token \
--ca-cert /tmp/$SRC.ca.crt \
--server $(kubectl --context $SRC config view -o jsonpath="{.clusters[?(@.name=='$SRC')].cluster.server}") \
> /tmp/istio-remote-secret-$SRC.yaml
# Apply it on every PEER cluster
for DST in cluster-1 cluster-2 cluster-3; do
[ "$SRC" = "$DST" ] && continue
kubectl --context $DST apply -f /tmp/istio-remote-secret-$SRC.yaml
done
done
# For 3 clusters: 6 cross-applies. For 10 clusters: 90.
/tmp with a bearer token in it). The
per-cluster verify-and-rollback story is on you.
What the operator actually types
Handled automatically — the management plane re-issues the trust material to its workload agents via its own pipeline. No human-touched secret material.
# Nothing.
#
# The relay agents on workload clusters authenticate to the mgmt
# server via mTLS — using identities the mgmt plane itself issues
# and rotates on the schedule you set in RootTrustPolicy.
#
# For visibility into the rotation cadence:
kubectl --context mgmt -n gloo-mesh get kubernetescluster -o wide
# NAME STATUS AGENT VERSION LAST HEARTBEAT
# cluster-1 Connected 2.12.0 2026-05-15T08:14:33Z
# cluster-2 Connected 2.12.0 2026-05-15T08:14:35Z
# cluster-3 Connected 2.12.0 2026-05-15T08:14:31Z
AccessPolicy across all clusters
"Only services in the payments namespace can call billing" — the same
rule needs to be in force on every cluster that runs either workload, and the rule must stay
consistent as the team adds and removes services.
About — what's the real cost?
The manual path is fine until your kustomize overlays drift. The honest
failure mode isn't writing the policy — it's that a junior engineer fixes a 503 in production
by adding "*" to an overlay on cluster-2 to "unblock the incident", then forgets to
propagate the rollback. Three months later, cluster-2 has a permissive policy and cluster-1 has
the strict one and nobody has noticed.
What the management plane changes: one CR in the mgmt namespace, translated
to per-cluster AuthorizationPolicy automatically. There is no overlay to drift.
The audit query is "show me every cluster where this policy is in force" and the answer is in
the mgmt plane's status.
What the operator commits to git
Write AuthorizationPolicy once per cluster, maintain in sync via GitOps with N
kustomize overlays.
# overlays/cluster-1/billing-allow-payments.yaml
apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
name: billing-allow-payments
namespace: billing
spec:
selector: { matchLabels: { app: billing } }
action: ALLOW
rules:
- from:
- source:
principals:
- "cluster.local/ns/payments/sa/payments"
# ... and the same file copied into overlays/cluster-2/...
# ... and overlays/cluster-3/...
# When the SA name changes, you change it in N places.
What the operator commits to git
One CR in the mgmt plane, translated to per-cluster AuthorizationPolicy on
every workload cluster automatically. Single source of truth.
# One file — applied to the mgmt cluster.
# The mgmt plane translates this to per-cluster AuthorizationPolicy
# in every workload cluster where billing or payments exists.
apiVersion: security.policy.gloo.solo.io/v2
kind: AccessPolicy
metadata:
name: billing-allow-payments
namespace: gloo-mesh
spec:
applyToDestinations:
- selector:
labels: { app: billing }
config:
authn:
mtls: { required: true }
authz:
allowedClients:
- serviceAccountSelector:
labels: { app: payments }
*.gloo.solo.io group and the exact kind + apiVersion
track the management-plane minor version. The agentic-AI side of the stack uses a different
family (policy.kagent-enterprise.solo.io/AccessPolicy) for waypoint-attached
runtime authz — separate from the mesh-wide AccessPolicy shown above. Read
the current policies docs
before writing the YAML you'll commit.
Federate a new Service on Day 90 — "make it global"
A new microservice needs to be reachable across the mesh. On Day 1 you knew which Services
needed the solo.io/service-scope=global label and you set it. On Day 90 a team adds
a new service and asks "how do we make this one global?"
About — what's the real cost?
The manual path works if you remember every cluster. Label the Service on every cluster that runs it. Verify istiod picked it up. Debug per-cluster if it didn't. The cost grows linearly with the number of clusters and quadratically with the number of teams who need to do this.
The management plane gives you one declarative intent. Solo 2.12
introduced the Segment CR — the Ambient-native global-aliasing primitive that
fronts a Service across every cluster registered with the mesh. Apply once at the mgmt layer;
the translation (the solo.io/service-scope=global label + any per-cluster
plumbing) happens everywhere the segment resolves.
Why we don't use VirtualService or VirtualDestination here. Both are
sidecar-era primitives that pre-date Ambient. They still exist for back-compat with Solo Mesh
2.x deployments running classic Istio sidecars, but Ambient's data plane (ztunnel + waypoints)
doesn't honour them. Segment is what 2.12 added specifically because Ambient needed
its own federation primitive.
What the operator types per-cluster
Label the Service on every cluster that runs it. Verify istiod programmed the synthetic global hostname. Debug per-cluster if it didn't.
# Per-cluster: label the Service, then verify
for C in cluster-1 cluster-2 cluster-3; do
kubectl --context $C -n bookinfo label svc productpage \
solo.io/service-scope=global --overwrite
# Did istiod pick it up?
istioctl --context $C multicluster check | grep "Shared Services"
done
# If one cluster shows zero shared services, you're debugging per-cluster:
# - is istiod's clusterID right?
# - is the network label on the namespace?
# - is the multicluster license actually present (lt: ent, not lt: trial)?
What the operator commits once
Declare the federation intent in one Segment CR on the mgmt plane. The mgmt
plane handles per-cluster translation and surfaces the result in its status.
# One file, applied once on mgmt — the Ambient-native global-aliasing CR.
apiVersion: networking.solo.io/v2alpha1
kind: Segment
metadata:
name: productpage-global
namespace: gloo-mesh
spec:
# Workloads anywhere in the mesh dial this hostname; istiod programs the
# synthetic VIP (240.240.0.x/16) with endpoints from every cluster that
# has a matching Service.
hosts:
- productpage.bookinfo.mesh.internal
selector:
namespace: bookinfo
labels: { app: productpage }
ports:
- number: 9080
protocol: HTTP
# Verify which clusters translated it (which workload clusters now have
# the Service labelled global and istiod programmed the synthetic VIP):
kubectl --context mgmt -n gloo-mesh get segment productpage-global \
-o jsonpath='{.status.clusters}'
Segment shipped in Solo
Enterprise for Istio 2.12 specifically as the Ambient global-aliasing primitive. The exact
field names above (selector, hosts, ports) follow
the documented schema but Solo iterates field-level on these CRs between minor releases —
check docs.solo.io for the
version you're running before committing YAML to git. The concept is stable; the
field names may shift one or two patches.
Audit — who registered cluster X with the mesh?
The compliance auditor's exact question, recorded verbatim from a real engagement: "Show me, for each cluster currently in your mesh, the date it was added, the person who added it, and the trust-domain it was added under." You have 24 hours to produce the answer.
About — what's the real cost?
The manual path is "dig through git history". If your install scripts are in git, the answer is reconstructible — when was the peer-bundle for cluster-X first applied on each peer? Who pushed that commit? The reconstruction is possible but it's a forensic exercise, not a query.
The management plane gives you a CR with a creation timestamp. The
`KubernetesCluster` CR in the mgmt namespace was created the moment cluster-X was registered.
Standard Kubernetes audit-logging on the mgmt cluster captures the identity that did it. The
auditor's question reduces to kubectl get.
What the operator does
Dig through git history of the peer-bundle apply YAML, hope someone kept ChatOps records, reconcile with the cert serial numbers visible in the intermediate cert chain.
# Best-case reconstruction:
git -C ops/gitops log --diff-filter=A -- "remote-secrets/cluster-prod-eu-west-3*.yaml"
# commit abc123 Author: Jane Doe Date: 2026-02-12
# commit def456 Author: Jane Doe Date: 2026-02-12
# What trust-domain was used? Inspect the intermediate cert SAN in the
# secret material — assuming it's still in the bucket:
openssl x509 -in ./tmp/prod-eu-west-3/ca-cert.pem -text -noout \
| grep -A1 "Subject Alternative Name"
# What permissions were granted? The remote-secret is bound to a SA — go
# read that SA's RoleBindings on the source cluster.
What the operator does
Query the mgmt plane's KubernetesCluster CRs. When, by whom, with what trust
domain, current health — all in one place.
# The CR exists from the moment of registration.
kubectl --context mgmt -n gloo-mesh get kubernetescluster -o wide
# NAME CREATED AGENT VERSION TRUST DOMAIN STATUS
# cluster-1 2025-08-12 2.12.0 cluster.local Connected
# cluster-2 2025-08-12 2.12.0 cluster.local Connected
# cluster-3 2025-08-12 2.12.0 cluster.local Connected
# prod-eu-west-3 2026-02-12 2.12.0 cluster.local Connected
# Who did it? Kubernetes audit log on the mgmt cluster:
kubectl --context mgmt -n gloo-mesh get events \
--field-selector involvedObject.name=prod-eu-west-3
# 2026-02-12T10:14Z Registered by jane.doe@bank.example.com
Observability — one pane vs N panes
A user complains that productpage is slow. The call traverses three clusters. Where
is the latency?
About — what's the real cost?
The manual path's failure mode is correlation. Each cluster has its own Prometheus, its own Grafana, its own Jaeger. When the call crosses clusters, the trace ID is preserved (Istio is good about this) but joining the segments requires either federated metrics storage (Thanos / Cortex / Mimir) that you set up yourself, or per-cluster tab-switching while your pager is going off.
The Gloo UI on the management plane consolidates this: service topology across the mesh, aggregated metrics, cluster-health view, and the cross-cluster trace joined for you. It's the same Prometheus data underneath — the difference is who does the federation.
What the operator does at 2am
Open three Grafana tabs. Open three Jaeger tabs. Join the trace IDs by hand. Reproduce the call. Eyeball.
# Per-cluster setup that the operator built and now maintains:
# - Prometheus + node-exporter + kube-state-metrics ×N clusters
# - Grafana + dashboards ×N clusters
# - Jaeger / Tempo ×N clusters
# - (optionally) Thanos / Mimir to federate metrics — built by you
#
# Investigation flow:
# 1. Open Grafana on cluster-1, find the failing request
# 2. Note the trace ID
# 3. Open Jaeger on cluster-1, find the segment that exits the cluster
# 4. Switch to Jaeger on cluster-2, find the segment that enters
# 5. Repeat for cluster-3
# 6. Reconstruct the timeline in your head
What the operator does at 2am
Open the Gloo UI on the mgmt cluster. Look at the service-topology view. The cross-cluster trace is already joined; per-hop latency is on the edge.
# One UI, fed by the mgmt plane's telemetry pipeline.
# Service topology across the whole mesh, with per-edge latency.
# Cluster-health view: which agents are connected, which aren't.
# Aggregated metrics keyed by service, namespace, AND cluster.
# Same Prometheus data underneath — the difference is that the
# mgmt plane federates it for you.
kubectl --context mgmt -n gloo-mesh port-forward svc/gloo-mesh-ui 8090:8090
# Open http://localhost:8090
Architecture comparison
The visual story behind why the manual path doesn't scale: every edge in the manual diagram is
a cross-cluster kubectl apply the operator has to do (and re-do, and audit). In the
managed diagram every edge is automated.
Manual — complete graph: N×(N−1) trust + secret exchange edges
┌──────────────┐ ┌──────────────┐
│ cluster-1 │ ◄────────────► │ cluster-2 │
│ istiod-gloo │ remote-sec │ istiod-gloo │
│ east-west │ peer-bundle │ east-west │
└──────┬───────┘ └──────┬───────┘
│ │
│ remote-sec, peer-bundle │
▼ ▼
┌──────────────────────────────────────────────┐
│ cluster-3 │
│ istiod-gloo, east-west GW, cacerts Secret │
└──────────────────────────────────────────────┘
Each ◄────► edge is TWO kubectl applies + the cert material to back them.
For N clusters the operator owns N*(N-1) such edges.
Adding cluster-4: 6 NEW edges (2 per existing peer).
Rotating root CA: every cacerts Secret on every cluster, in lockstep.
Solo Enterprise for Istio — star: N workload clusters register with one mgmt plane
┌────────────────────────────────────────────────────┐
│ Solo Enterprise for Istio management plane │
│ │
│ KubernetesCluster CRs │
│ RootTrustPolicy │
│ AccessPolicy / Workspace │
│ Segment (Ambient federation) │
│ Gloo UI (federated metrics) │
└──────┬───────────────┬───────────────┬─────────────┘
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ cluster-1 │ │ cluster-2 │ │ cluster-3 │
│ gloo-mesh- │ │ gloo-mesh- │ │ gloo-mesh- │
│ agent │ │ agent │ │ agent │
│ istiod-gloo │ │ istiod-gloo │ │ istiod-gloo │
│ east-west │ │ east-west │ │ east-west │
└─────────────┘ └─────────────┘ └─────────────┘
Each line is one mTLS pipe the mgmt plane owns end-to-end.
Adding cluster-4: ONE new edge (mgmt → cluster-4).
Rotating root CA: ONE RootTrustPolicy change, mgmt plane rolls it.
Data-plane (cluster-to-cluster HBONE) still flows direct — same data plane,
the mgmt plane is for trust + policy + audit, not for proxying traffic.
When to use which
The manual path is genuinely fine for some real use-cases — the management plane isn't always the right answer. Below is the decision matrix: pick the column whose row matches your situation.
| Use manual when… | Use Solo Enterprise for Istio when… |
|---|---|
| Single-cluster mesh (peering doesn't apply; no Day-2 multi-cluster surface to manage) | More than 2 clusters in the same mesh — the operational cost crosses over fast |
| POC, learning, internal demo where the value is "see the plumbing" | Production with compliance / audit requirements — SOC 2, PCI, DORA, HIPAA. The audit trail is the feature |
| Two clusters that won't change for the lifetime of the project | Clusters added / removed on a normal Day-2 cadence (regions, M&A, tenants) |
| You have a strong PKI team running cert-manager + Vault and can write the controllers that distribute intermediates yourself | You want certificate lifecycle managed for you — rotation, revocation, Vault / AWS PCA integration, and rotation reporting |
Your goal is "show my SREs how the *.mesh.internal hostname works"
and stop there |
You want one policy (AccessPolicy, Workspace)
to govern N clusters from one source of truth |
| You're rebuilding the cluster every Friday because it's a dev environment and forever is "until 5pm" | You need to answer "who registered cluster X, when, with what trust domain" in one
kubectl get |
- If they're "how do we rotate the root CA without an outage?", "who can register a new cluster?", "how do we audit policy changes across clusters?" — start with the management plane. That's the production-readiness story it's built for.
- If they're "how does HBONE actually work?", "what's in ztunnel's xDS?", "what does an east-west gateway look like on the wire?" — stand up the manual version first so the plumbing is in your head. Layer the management plane on top later.
How you'd actually install it
This section walks through the install as it ships on Solo Enterprise for Istio 2.12.4. The shape is the same in production (on cloud Kubernetes or on-prem); local-dev with kind has one extra step around east-west GW exposure, called out explicitly below.
About — what this does & why
What: A two-step install — first the management plane goes onto its own cluster (typically a dedicated, low-traffic cluster, since it's a control plane in its own right), then each workload cluster registers with it via its relay agent connecting back to the mgmt-server.
Why: The split keeps the blast radius of management-plane changes off your workload clusters. The relay-agent on each workload cluster is the only mgmt-plane component that lives next to your apps.
# 1. CRDs on EVERY cluster (mgmt + workload). installEnterpriseCrds=true is
# what brings in AccessPolicy / Workspace / WorkspaceSettings — the CRs the
# mgmt plane uses to translate central policy into per-cluster
# AuthorizationPolicy. The Solo documented Ambient pattern uses
# installEnterpriseCrds=false; do NOT use that pattern if you want
# centralised RBAC — flip to true and resolve any agentgateway-crds
# co-install conflict (the two charts collide on authconfigs +
# ratelimitconfigs; uninstall enterprise-agentgateway on the mgmt cluster
# or keep agentgateway off the mgmt-plane cluster entirely).
helm upgrade -i gloo-platform-crds gloo-platform/gloo-platform-crds \
--kube-context $MGMT -n gloo-mesh --create-namespace \
--version 2.12.4 \
--set installEnterpriseCrds=true \
--set featureGates.ConfigDistribution=true # mgmt only
helm upgrade -i gloo-platform-crds gloo-platform/gloo-platform-crds \
--kube-context $WORKLOAD -n gloo-mesh --create-namespace \
--version 2.12.4 \
--set installEnterpriseCrds=true
# 2. Management plane (mgmt-server + UI + Prometheus + relay + cert-gen job)
# on the mgmt cluster. The chart generates the relay-*-tls-secret family
# on first install; never partial-rollout these (an out-of-sync chain
# across relay-root / relay-server / relay-client breaks the agent
# handshake — clean uninstall + reinstall is the production-safe recovery).
helm upgrade -i gloo-platform gloo-platform/gloo-platform \
--kube-context $MGMT -n gloo-mesh \
--version 2.12.4 -f mgmt-values.yaml \
--set licensing.glooMeshLicenseKey=$GLOO_MESH_LICENSE_KEY
# 3. Each workload cluster: cross-apply the relay TLS secrets from the mgmt
# cluster, then install the agent. The agent dials the mgmt-server over
# mTLS using those secrets — its serverAddress points at the mgmt-server
# Service IP (LoadBalancer recommended for cross-host reach).
kubectl --context $MGMT -n gloo-mesh get secret relay-root-tls-secret \
-o yaml | kubectl --context $WORKLOAD -n gloo-mesh apply -f -
# (repeat for relay-client-tls-secret + relay-identity-token-secret)
helm upgrade -i gloo-platform gloo-platform/gloo-platform \
--kube-context $WORKLOAD -n gloo-mesh \
--version 2.12.4 -f agent-values.yaml \
--set licensing.glooMeshLicenseKey=$GLOO_MESH_LICENSE_KEY
# 4. Register the workload cluster on the mgmt plane (one CR).
kubectl --context $MGMT apply -f - <<'EOF'
apiVersion: admin.gloo.solo.io/v2
kind: KubernetesCluster
metadata: { name: prod-eu-west-3, namespace: gloo-mesh }
spec: { clusterDomain: cluster.local }
EOF
# 5. Verify on the mgmt cluster:
kubectl --context $MGMT -n gloo-mesh get kubernetescluster
# NAME STATUS
# mgmt ACCEPTED
# prod-eu-west-3 ACCEPTED
About — four install details worth knowing up front
The shape of the install is straightforward. Four details are worth knowing before you start, because they cross multiple layers of the stack and aren't obvious from any one README:
- Prefer Helm directly over
meshctl install --profile gloo-mesh-mgmtfor the mgmt+agent topology. The combined profile is intended for single-cluster demos; for a production multi-cluster install with explicit values,helm install gloo-platform(as shown above) gives you the same result with fewer layers and a values file you can commit to git. This is also the path thatsuper-quick.shin this repo codifies. - Set the mgmt-server Service to
LoadBalancerwhen the agent runs in a different cluster. The chart's default Service type isClusterIP, which is sufficient when the agent runs in the mgmt cluster. The moment a workload cluster needs to reach the relay across a network boundary, the agent needs a routable target —LoadBalancer(MetalLB on kind, a real cloud LB in production) or an Ingress in front of port 9900. - The
relay-*-tls-secretfamily must stay in lockstep.relay-root-tls-secret,relay-server-tls-secret,relay-client-tls-secret,relay-tls-signing-secret, andrelay-identity-token-secretare generated by the chart's cert-gen Job on first install and form one cryptographic chain. The mgmt-server reads them by name via command-line flags rather than by volume mount, so akubectl rollout restartwon't re-pick them up if one drifts. If the agent handshake starts failing with "authentication handshake failed: EOF", the production-safe recovery ishelm uninstall gloo-platformfollowed by a freshhelm installso the cert-gen Job re-runs and produces the whole family atomically. - Mint a per-cluster
relay-client-tls-secret; don't copy the mgmt cluster's. The chart's cert-gen Job bakes the mgmt cluster's name into the certificate asCN=<mgmt-cluster>with a matching SAN. If you cross-apply that secret verbatim to a workload cluster, the mgmt-server's mTLS handshake sees the wrong CN and bins that workload's inventory under the mgmt cluster's name — theWorkspaceends up withnumSelectedClusters: 1, theAccessPolicytranslation never lands on the workload, and you'll spend a long afternoon chasing it. The mgmt-server'srelay-tls-signing-secretis the CA that should mint per-cluster client certs; that's whatmeshctl cluster registerdoes internally.super-quick.shperforms the equivalent step inline withopensslso the workload cluster'srelay-client-tls-secretcarriesCN=<workload-cluster>as it should.
Cross-cluster peering on Ambient: auto vs manual
Once both clusters are registered with the mgmt plane, they need data-plane peering —
the istio-remote Gateway in each cluster that points at the other's east-west GW
so HBONE traffic can flow. In Solo Enterprise for Istio 2.12 there are two paths to wire this.
Both ride on the same shared root CA + per-cluster intermediates (in the cacerts
Secret) that the manual lab on the left already establishes — Ambient's trust pipeline is
bootstrapped outside the mgmt plane, which is also why the older
RootTrustPolicy CR is documented as superseded in the Ambient install guide.
One feature gate, mgmt plane wires it
Set featureGates.ConfigDistribution=true on the mgmt-cluster
gloo-platform release and PEERING_AUTOMATIC_LOCAL_GATEWAY=true on
every istiod. The mgmt server picks up each cluster's local east-west Gateway, replicates
a matching istio-remote peer Gateway into every other registered cluster, and
keeps them in sync as clusters come and go.
helm upgrade gloo-platform gloo-platform/gloo-platform \
--kube-context $MGMT -n gloo-mesh \
--set featureGates.ConfigDistribution=true
kubectl --context $C -n istio-system patch deploy istiod-gloo --type=json \
-p='[{"op":"add","path":"/spec/template/spec/containers/0/env/-",
"value":{"name":"PEERING_AUTOMATIC_LOCAL_GATEWAY","value":"true"}}]'
multi-cluster/automatic-peering-ambient-multicluster.md): "This feature is
beta. Do not use it in production." Treat it as a preview that will
become the default — useful for greenfield clouds where the east-west Gateway's
LoadBalancer EXTERNAL-IP is routable from every peer cluster, the assumption it bakes in.
One Gateway per peer, explicit addresses
Apply an istio-remote Gateway in each cluster's
istio-eastwest namespace for each peer, with the peer's east-west GW address
hard-coded. This is the GA pattern (multi-cluster/east-west-gateway-peering.md)
and the one that works regardless of whether peer addresses are LoadBalancer IPs, NodePort,
or — in our kind-on-two-Macs case — a LAN IP republished by socat.
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: manual-peer-west-mini
namespace: istio-eastwest
annotations:
gateway.istio.io/service-account: istio-eastwest # MUST match the peer's
gateway.istio.io/trust-domain: cluster.local # east-west GW SA — SPIFFE
labels: # SAN verification uses this
topology.istio.io/cluster: west-mini
topology.istio.io/network: west-mini
spec:
gatewayClassName: istio-remote
addresses:
- { type: IPAddress, value: 192.168.1.18 } # peer's east-west GW
listeners: # LAN-routable IP
- { name: cross-network, port: 15008, protocol: HBONE, tls: { mode: Passthrough }, allowedRoutes: { namespaces: { from: Same } } }
- { name: xds-tls, port: 15012, protocol: TLS, tls: { mode: Passthrough }, allowedRoutes: { namespaces: { from: Same } } }
gateway.istio.io/service-account: istio-eastwest the ztunnel client expects
the SPIFFE SAN to derive from the Gateway's name, not the peer's actual
east-west GW ServiceAccount — mTLS fails with "peer did not present the expected SAN".
It's a one-line workaround for what is, otherwise, the cleanest path.
.status.loadBalancer.ingress[0]. On kind that address is a Docker-bridge IP
(e.g. 172.18.255.100), which is not routable from another physical host. For local-dev
scenarios where you want to mirror the production topology on two laptops, the repo includes
helpers: scripts/expose-ew-on-host.sh republishes the east-west GW on the host's
LAN IP via a socat container, and scripts/super-quick.sh chains that helper
with the manual peer-Gateway pattern (LAN IPs + the
gateway.istio.io/service-account annotation) to stand the whole topology up in
one command:
./scripts/super-quick.sh # prompts for SSH user / host
./scripts/super-quick.sh --user <user> --host <host> # non-interactive
In a cloud environment (AKS, EKS, GKE) this step is unnecessary — the cloud LoadBalancer EXTERNAL-IP is already cross-cluster routable, so either the manual peering pattern (with the LB IP) or the auto-peering preview works directly.
gloo-platform and gloo-platform-crds in the gloo-mesh
namespace.
Where to next
- Solo AgentGateway Ambient Multicluster — Standup — the manual standup, so you can see the plumbing on kind before you decide
- Cloud Connectivity Lab — failover, waypoint, egress — the application labs that work the same way under both regimes
- Agentic / MCP Lab — MCP federation, JWT RBAC, SPIFFE authz under the manual standup
- docs.solo.io/gloo-mesh-enterprise — current install + CRD reference for the management plane
- docs.solo.io — manual multicluster ambient install — the canonical manual reference