MastertheMesh
Solo · Cloud Connectivity · Decision Document
Reference · No clusters required

Manual multi-cluster vs Solo Enterprise for Istio management plane

TO
Tom O'Rourke
EMEA Field CTO · Solo.io

Both approaches give you the same data plane — Solo Istio Ambient with ztunnel and HBONE east-west peering. The operator experience and the production-safety story are completely different. This page walks through the seven Day-2 realities of running a multi-cluster mesh on each path, so you can pick the right one for your environment.

cert rotation cluster registration RBAC audit Day 2 ops Solo Enterprise for Istio

Solo's product was renamed in 2.12 — what used to be called Gloo Mesh is now Solo Enterprise for Istio. The management plane on top (gloo-platform Helm chart, gloo-mesh namespace, meshctl CLI) keeps those original artifact names — that's still what shows up in your cluster — but the product name used in prose throughout this page is the new one.

Either path delivers the same wire protocol — Solo Istio Ambient pods talking HBONE on port 15008 to peer east-west gateways, identity in SPIFFE, mTLS everywhere. The choice is about who runs the trust pipeline, who carries the credential material, and who provides the audit trail. The questions that matter aren't about throughput; they're about Day 90 and the compliance audit that follows.

Why this matters

A working multi-cluster mesh is the start of the operator's problem, not the end. Stand-up is a one-time event; Day 2 is forever. The seven production realities below are where the manual path stops scaling — each one is a side-by-side comparison further down the page.

  1. 1 Root CA rotation. Rare but high-stakes — driven by compromise, cryptographic obsolescence (e.g. moving from RSA-2048 to RSA-4096), organisational change (M&A, trust-domain restructure), or aged key material. Root certs typically live 10–25 years; the rotation needs to happen without breaking in-flight mTLS across every cluster in the mesh.
  2. 2 Intermediate & workload-cert rotation. Intermediates typically rotate every 1–3 years; workload SPIFFE certs every ~24h by default. Every cluster × every pod × every day on the workload tier — at scale, automation isn't optional.
  3. 3 Adding a cluster on Day 90. A new region, a new acquisition, a new tenant — without touching the existing N clusters.
  4. 4 Audit. "Who allowed cluster X to join the mesh on date Y, with what permissions?" — answered in seconds, not days.
  5. 5 RBAC consistency across N clusters. One policy in mgmt → enforced everywhere, no per-cluster drift.
  6. 6 Federation across trust domains. M&A — bank acquires fintech, both already on Istio, both with their own root of trust.
  7. 7 Observability. A single pane of glass across the mesh — vs N Grafanas, N Kialis, N sets of dashboards that drift.

Side-by-side scenarios

For each scenario, the amber column is what the operator actually types on the manual path; the green column is what the operator types when the Solo Enterprise for Istio management plane is in front of the mesh. Same data plane underneath — different operator experience on top.

SCENARIO A

Bootstrap a new cluster into an existing 3-cluster mesh

You already have a working three-cluster ambient mesh. A new region (call it prod-eu-west-3) gets approved. The fourth kubeconfig lands on your laptop. What happens between "kubeconfig arrives" and "first workload accepts cross-cluster traffic"?

About — what's the real cost?

The honest cost on the manual path: roughly 20–30 minutes per new cluster in a 3-existing-cluster mesh, assuming everything works first time. The dominant cost isn't the typing — it's the private key material being copied over scp (root CA private key) and the N×(N−1) peer exchange: every existing cluster needs the new cluster's peer bundle, and the new cluster needs everyone else's. That copying is the operation a compliance auditor will ask you about in twelve months.

The honest cost on the management-plane path: roughly 60 seconds, because cluster registration is one CR (`KubernetesCluster` in the mgmt namespace) plus a service-account token bound to it. The mgmt plane already holds the trust anchor; it issues a per-cluster intermediate and pushes it via the agent running on the new cluster. No private key crosses your laptop.

Manual

What the operator actually types

Generate a new intermediate from the shared root CA, copy root-ca.crt and root-ca.key to the new host, install the mesh, expose the east-west GW, and cross-apply remote-secrets 2×N times (every existing cluster ↔ the new cluster).

# 1. On a trusted ops host: mint a new intermediate from the shared root
./gen-intermediate.sh prod-eu-west-3 \
  --root-ca-cert ~/safe/root-ca.crt \
  --root-ca-key  ~/safe/root-ca.key \
  --out ./prod-eu-west-3-cacerts

# 2. scp the cacerts Secret material onto the new cluster
kubectl --context prod-eu-west-3 create ns istio-system
kubectl --context prod-eu-west-3 -n istio-system create secret generic cacerts \
  --from-file=./prod-eu-west-3-cacerts/

# 3. Stand up Istio Ambient + east-west GW + peering chart on the new cluster
helm install istio-base ... ; helm install istiod-gloo ... ;
helm install istio-cni ... ; helm install ztunnel ... ;
helm install peering ...   # exposes the istio-eastwest GW

# 4. Cross-apply remote-secrets — N*(N-1) operations
for SRC in cluster-1 cluster-2 cluster-3 prod-eu-west-3; do
  for DST in cluster-1 cluster-2 cluster-3 prod-eu-west-3; do
    [ "$SRC" = "$DST" ] && continue
    istioctl create-remote-secret --context $SRC --name $SRC \
      | kubectl --context $DST apply -f -
  done
done

# 5. Restart istiod everywhere so it re-reads the new remote-secrets
for C in cluster-1 cluster-2 cluster-3 prod-eu-west-3; do
  kubectl --context $C -n istio-system rollout restart deploy/istiod-gloo
done
Real cost: ~20–30 min per cluster, root-CA private key copied over scp, 12 cross-cluster apply operations for a 4-cluster final state, plus a rolling istiod restart that briefly stalls xDS for in-flight pods.
Solo Enterprise for Istio

What the operator actually types

Register the kubeconfig with the management plane. The mgmt agent on the new cluster handles intermediate-CA issuance and peer-secret distribution; the existing clusters are not touched.

# Single command on the ops host. The mgmt plane already trusts the
# root CA and issues a per-cluster intermediate. The agent installed
# on the new cluster pulls config from the mgmt plane.
meshctl cluster register prod-eu-west-3 \
  --mgmt-context    mgmt \
  --remote-context  prod-eu-west-3 \
  --version         2.12.0

# Verify
kubectl --context mgmt -n gloo-mesh get kubernetescluster prod-eu-west-3
# NAME              AGE   AGENT-VERSION   STATUS
# prod-eu-west-3    34s   2.12.0          Connected
Real cost: ~60 seconds, no private key leaves the mgmt cluster, no operations on the existing N clusters. The new cluster appears as a `KubernetesCluster` CR with a creation timestamp and an audit trail of who registered it.
SCENARIO B

Rotate the root CA

Root CAs live a long time (10–25 years is typical), so this isn't a routine cadence — but when it has to happen, it has to happen without breaking in-flight mTLS across every cluster in the mesh. Real triggers: compromise / breach response, cryptographic obsolescence (e.g. RSA-2048 → RSA-4096 or an algorithm change), M&A trust-domain restructure, or aged key material approaching its policy limit. The rotation runs through a dual-trust window where both old and new roots are accepted simultaneously, then the old one is retired.

About — what's the real cost?

Why this is the scariest scenario on the manual path: if you update one cluster's cacerts Secret to use a new intermediate signed by the new root, but a peer cluster hasn't received the new root in its ca-cert.pem chain yet, that peer will reject the SVID it receives over HBONE — and you've just broken cross-cluster mTLS mid-rotation. There is no rollback button that's fast enough; you have to manually re-sequence.

What the management plane gives you: the rotation is orchestrated. The mgmt plane knows the topology, knows which clusters trust which roots, and rolls the change through them in an order it has proven safe. It also integrates upstream — your RootTrustPolicy can source the new root from Hashicorp Vault or AWS Private CA rather than a file you generated by hand. That's the integration most enterprise PKI teams actually need to sign off on the design.

Manual

What the operator actually types

Generate a new root, regenerate every cluster's intermediate, distribute via cacerts Secret update across all N clusters in lockstep (otherwise mTLS breaks mid-rotation), restart istiod everywhere, hope nobody's mid-handshake.

# 1. Generate the new root CA (on a trusted ops host)
./gen-root-ca.sh --out ~/safe/root-ca-2026Q2

# 2. For each cluster: regenerate the intermediate signed by the NEW root,
#    but keep the OLD root in the trust chain during the transition
for C in cluster-1 cluster-2 cluster-3; do
  ./gen-intermediate.sh $C \
    --root-ca-cert ~/safe/root-ca-2026Q2/root-ca.crt \
    --root-ca-key  ~/safe/root-ca-2026Q2/root-ca.key \
    --out ./tmp/$C

  # Patch the cacerts Secret with BOTH roots in ca-cert.pem (old + new)
  cat ~/safe/root-ca-2025Q1/root-ca.crt \
      ~/safe/root-ca-2026Q2/root-ca.crt \
      > ./tmp/$C/ca-cert.pem

  kubectl --context $C -n istio-system create secret generic cacerts \
    --from-file=./tmp/$C/ --dry-run=client -o yaml \
    | kubectl --context $C -n istio-system apply -f -
done

# 3. Rolling istiod restart so it picks up the new intermediate
for C in cluster-1 cluster-2 cluster-3; do
  kubectl --context $C -n istio-system rollout restart deploy/istiod-gloo
  kubectl --context $C -n istio-system rollout status  deploy/istiod-gloo
done

# 4. Wait for every workload pod to have re-issued an SVID from the new int
#    (default rotation period ~24h). Only then is it safe to remove the OLD root.

# 5. Drop the OLD root from ca-cert.pem and re-apply cacerts everywhere
Real cost: a multi-day change-window in regulated environments. Lockstep ordering across clusters is on the operator. The "drop the OLD root" step at the end is the one that bites — most teams leave it in forever to be safe, which defeats the rotation.
Solo Enterprise for Istio

What the operator actually types

Rotate the root in the mgmt plane's RootTrustPolicy CR (or upstream in Vault / AWS Private CA, which the mgmt plane reads as its source-of-truth). The mgmt plane orchestrates rolling rotation across all registered clusters.

# With Vault as the upstream root of trust:
kubectl --context mgmt -n gloo-mesh apply -f - <<'EOF'
apiVersion: admin.gloo.solo.io/v2
kind: RootTrustPolicy
metadata:
  name: root-trust
  namespace: gloo-mesh
spec:
  config:
    mgmtServerCa:
      generated: {}
    intermediateCertOptions:
      secretRotationGracePeriodRatio: 0.5    # auto re-issue intermediates at 50% TTL
    autoRestartPods: true                    # gracefully roll workloads after rotation
    agentCa:
      vault:                                  # source of truth = Vault
        caPath: pki_root/ca
        csrPath: pki_root/sign/intermediate
        server: https://vault.example.com:8200
        kubernetesAuth:
          role: gloo-mesh-mgmt
EOF

# When the auditor asks "was the rotation successful on cluster-3?"
kubectl --context mgmt -n gloo-mesh get kubernetescluster cluster-3 -o yaml \
  | yq '.status.caStatus'
Real cost: change-window measured in minutes, not days. Vault or AWS Private CA owns the root key material — the operator never touches it. Lockstep ordering is the mgmt plane's responsibility, not yours. The audit answer is a `kubectl get` away.
Pin the exact CRD names to your version. The RootTrustPolicy CR above is the long-standing Gloo-Mesh-managed PKI primitive. Field names and the exact upstream integrations (Vault, AWS Private CA, cert-manager) evolve across Solo Enterprise for Istio minor versions — always check the current docs before writing the YAML you'll commit.
SCENARIO C

Rotate the remote-secret tokens

The kubeconfig Secrets that one cluster's istiod uses to discover the others — the istio-remote-secret-* Secrets in istio-system — are bearer tokens. Compliance frameworks treat them like any other API credential: rotate on a schedule, rotate on personnel change, rotate on suspected compromise.

About — what's the real cost?

The hidden cost on the manual path is that nobody actually does this rotation, because doing it correctly means N×(N−1) coordinated kubectl applys and a per-token validity check. So in practice the tokens just live forever, which is exactly the failure mode the rotation policy was supposed to prevent.

What the management plane changes: the mgmt plane re-issues the trust material via its agent-server pipeline. No long-lived bearer token sits in a kubeconfig in every cluster — the relay agents authenticate to the mgmt plane via mTLS using identities the mgmt plane itself issues.

Manual

What the operator actually types

Per-cluster: regenerate the bound service-account token, build a new remote-secret YAML, kubectl apply on every peer, verify istiod picked it up. N×(N−1) cross-applies.

# For each "source" cluster, rebuild its istio-reader token,
# then apply the resulting kubeconfig Secret on every PEER cluster.
for SRC in cluster-1 cluster-2 cluster-3; do
  # Re-create the bound token (the old one keeps working until TTL expires)
  kubectl --context $SRC -n istio-system create token istio-reader-service-account \
    --duration 8760h > /tmp/$SRC.token

  # Rebuild the kubeconfig Secret YAML with the new token
  ./build-remote-secret.sh \
    --context  $SRC \
    --token    /tmp/$SRC.token \
    --ca-cert  /tmp/$SRC.ca.crt \
    --server   $(kubectl --context $SRC config view -o jsonpath="{.clusters[?(@.name=='$SRC')].cluster.server}") \
    > /tmp/istio-remote-secret-$SRC.yaml

  # Apply it on every PEER cluster
  for DST in cluster-1 cluster-2 cluster-3; do
    [ "$SRC" = "$DST" ] && continue
    kubectl --context $DST apply -f /tmp/istio-remote-secret-$SRC.yaml
  done
done

# For 3 clusters: 6 cross-applies. For 10 clusters: 90.
Real cost: the operation that gets skipped. Token material has to live on the ops host while you do it (file in /tmp with a bearer token in it). The per-cluster verify-and-rollback story is on you.
Solo Enterprise for Istio

What the operator actually types

Handled automatically — the management plane re-issues the trust material to its workload agents via its own pipeline. No human-touched secret material.

# Nothing.
#
# The relay agents on workload clusters authenticate to the mgmt
# server via mTLS — using identities the mgmt plane itself issues
# and rotates on the schedule you set in RootTrustPolicy.
#
# For visibility into the rotation cadence:
kubectl --context mgmt -n gloo-mesh get kubernetescluster -o wide
# NAME         STATUS      AGENT VERSION   LAST HEARTBEAT
# cluster-1    Connected   2.12.0          2026-05-15T08:14:33Z
# cluster-2    Connected   2.12.0          2026-05-15T08:14:35Z
# cluster-3    Connected   2.12.0          2026-05-15T08:14:31Z
Real cost: zero operator action. The credential is a mesh-internal mTLS identity, not a kubeconfig with a bearer token, so the threat model is the same as the rest of the mesh.
SCENARIO D

AccessPolicy across all clusters

"Only services in the payments namespace can call billing" — the same rule needs to be in force on every cluster that runs either workload, and the rule must stay consistent as the team adds and removes services.

About — what's the real cost?

The manual path is fine until your kustomize overlays drift. The honest failure mode isn't writing the policy — it's that a junior engineer fixes a 503 in production by adding "*" to an overlay on cluster-2 to "unblock the incident", then forgets to propagate the rollback. Three months later, cluster-2 has a permissive policy and cluster-1 has the strict one and nobody has noticed.

What the management plane changes: one CR in the mgmt namespace, translated to per-cluster AuthorizationPolicy automatically. There is no overlay to drift. The audit query is "show me every cluster where this policy is in force" and the answer is in the mgmt plane's status.

Manual

What the operator commits to git

Write AuthorizationPolicy once per cluster, maintain in sync via GitOps with N kustomize overlays.

# overlays/cluster-1/billing-allow-payments.yaml
apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
  name: billing-allow-payments
  namespace: billing
spec:
  selector: { matchLabels: { app: billing } }
  action: ALLOW
  rules:
    - from:
      - source:
          principals:
          - "cluster.local/ns/payments/sa/payments"

# ... and the same file copied into overlays/cluster-2/...
# ... and overlays/cluster-3/...
# When the SA name changes, you change it in N places.
Real cost: the per-cluster overlay is the drift vector. Identity matching works (SPIFFE survives the federation), but governance does not.
Solo Enterprise for Istio

What the operator commits to git

One CR in the mgmt plane, translated to per-cluster AuthorizationPolicy on every workload cluster automatically. Single source of truth.

# One file — applied to the mgmt cluster.
# The mgmt plane translates this to per-cluster AuthorizationPolicy
# in every workload cluster where billing or payments exists.
apiVersion: security.policy.gloo.solo.io/v2
kind: AccessPolicy
metadata:
  name: billing-allow-payments
  namespace: gloo-mesh
spec:
  applyToDestinations:
    - selector:
        labels: { app: billing }
  config:
    authn:
      mtls: { required: true }
    authz:
      allowedClients:
        - serviceAccountSelector:
            labels: { app: payments }
Real cost: one CR, one git file, one source of truth. The mgmt plane's status surface tells you which clusters translated it and which didn't. Audit is single-pane.
Pin the CRD apiVersion to your install. Solo's policy CRDs live in the *.gloo.solo.io group and the exact kind + apiVersion track the management-plane minor version. The agentic-AI side of the stack uses a different family (policy.kagent-enterprise.solo.io/AccessPolicy) for waypoint-attached runtime authz — separate from the mesh-wide AccessPolicy shown above. Read the current policies docs before writing the YAML you'll commit.
SCENARIO E

Federate a new Service on Day 90 — "make it global"

A new microservice needs to be reachable across the mesh. On Day 1 you knew which Services needed the solo.io/service-scope=global label and you set it. On Day 90 a team adds a new service and asks "how do we make this one global?"

About — what's the real cost?

The manual path works if you remember every cluster. Label the Service on every cluster that runs it. Verify istiod picked it up. Debug per-cluster if it didn't. The cost grows linearly with the number of clusters and quadratically with the number of teams who need to do this.

The management plane gives you one declarative intent. Solo 2.12 introduced the Segment CR — the Ambient-native global-aliasing primitive that fronts a Service across every cluster registered with the mesh. Apply once at the mgmt layer; the translation (the solo.io/service-scope=global label + any per-cluster plumbing) happens everywhere the segment resolves.

Why we don't use VirtualService or VirtualDestination here. Both are sidecar-era primitives that pre-date Ambient. They still exist for back-compat with Solo Mesh 2.x deployments running classic Istio sidecars, but Ambient's data plane (ztunnel + waypoints) doesn't honour them. Segment is what 2.12 added specifically because Ambient needed its own federation primitive.

Manual

What the operator types per-cluster

Label the Service on every cluster that runs it. Verify istiod programmed the synthetic global hostname. Debug per-cluster if it didn't.

# Per-cluster: label the Service, then verify
for C in cluster-1 cluster-2 cluster-3; do
  kubectl --context $C -n bookinfo label svc productpage \
    solo.io/service-scope=global --overwrite

  # Did istiod pick it up?
  istioctl --context $C multicluster check | grep "Shared Services"
done

# If one cluster shows zero shared services, you're debugging per-cluster:
#   - is istiod's clusterID right?
#   - is the network label on the namespace?
#   - is the multicluster license actually present (lt: ent, not lt: trial)?
Real cost: easy when it works, hard to debug when it doesn't, and the diagnosis loop is per-cluster.
Solo Enterprise for Istio

What the operator commits once

Declare the federation intent in one Segment CR on the mgmt plane. The mgmt plane handles per-cluster translation and surfaces the result in its status.

# One file, applied once on mgmt — the Ambient-native global-aliasing CR.
apiVersion: networking.solo.io/v2alpha1
kind: Segment
metadata:
  name: productpage-global
  namespace: gloo-mesh
spec:
  # Workloads anywhere in the mesh dial this hostname; istiod programs the
  # synthetic VIP (240.240.0.x/16) with endpoints from every cluster that
  # has a matching Service.
  hosts:
    - productpage.bookinfo.mesh.internal
  selector:
    namespace: bookinfo
    labels: { app: productpage }
  ports:
    - number: 9080
      protocol: HTTP

# Verify which clusters translated it (which workload clusters now have
# the Service labelled global and istiod programmed the synthetic VIP):
kubectl --context mgmt -n gloo-mesh get segment productpage-global \
  -o jsonpath='{.status.clusters}'
Real cost: one CR. A newly-registered cluster picks it up automatically the moment it joins the mesh. Status answers "which clusters is this in force on?" in one query — no per-cluster grep.
Verify the CR shape before you ship. Segment shipped in Solo Enterprise for Istio 2.12 specifically as the Ambient global-aliasing primitive. The exact field names above (selector, hosts, ports) follow the documented schema but Solo iterates field-level on these CRs between minor releases — check docs.solo.io for the version you're running before committing YAML to git. The concept is stable; the field names may shift one or two patches.
SCENARIO F

Audit — who registered cluster X with the mesh?

The compliance auditor's exact question, recorded verbatim from a real engagement: "Show me, for each cluster currently in your mesh, the date it was added, the person who added it, and the trust-domain it was added under." You have 24 hours to produce the answer.

About — what's the real cost?

The manual path is "dig through git history". If your install scripts are in git, the answer is reconstructible — when was the peer-bundle for cluster-X first applied on each peer? Who pushed that commit? The reconstruction is possible but it's a forensic exercise, not a query.

The management plane gives you a CR with a creation timestamp. The `KubernetesCluster` CR in the mgmt namespace was created the moment cluster-X was registered. Standard Kubernetes audit-logging on the mgmt cluster captures the identity that did it. The auditor's question reduces to kubectl get.

Manual

What the operator does

Dig through git history of the peer-bundle apply YAML, hope someone kept ChatOps records, reconcile with the cert serial numbers visible in the intermediate cert chain.

# Best-case reconstruction:
git -C ops/gitops log --diff-filter=A -- "remote-secrets/cluster-prod-eu-west-3*.yaml"
#   commit abc123  Author: Jane Doe  Date: 2026-02-12
#   commit def456  Author: Jane Doe  Date: 2026-02-12

# What trust-domain was used? Inspect the intermediate cert SAN in the
# secret material — assuming it's still in the bucket:
openssl x509 -in ./tmp/prod-eu-west-3/ca-cert.pem -text -noout \
  | grep -A1 "Subject Alternative Name"

# What permissions were granted? The remote-secret is bound to a SA — go
# read that SA's RoleBindings on the source cluster.
Real cost: a half-day forensic reconstruction per cluster, dependent on ops history being intact. The auditor's report writes itself as "control inadequate; no single source of truth for cluster onboarding".
Solo Enterprise for Istio

What the operator does

Query the mgmt plane's KubernetesCluster CRs. When, by whom, with what trust domain, current health — all in one place.

# The CR exists from the moment of registration.
kubectl --context mgmt -n gloo-mesh get kubernetescluster -o wide

# NAME              CREATED         AGENT VERSION   TRUST DOMAIN     STATUS
# cluster-1         2025-08-12      2.12.0          cluster.local    Connected
# cluster-2         2025-08-12      2.12.0          cluster.local    Connected
# cluster-3         2025-08-12      2.12.0          cluster.local    Connected
# prod-eu-west-3    2026-02-12      2.12.0          cluster.local    Connected

# Who did it? Kubernetes audit log on the mgmt cluster:
kubectl --context mgmt -n gloo-mesh get events \
  --field-selector involvedObject.name=prod-eu-west-3
# 2026-02-12T10:14Z  Registered  by jane.doe@bank.example.com
Real cost: two queries. The auditor's report writes itself as "control adequate; single source of truth with timestamped audit trail".
SCENARIO G

Observability — one pane vs N panes

A user complains that productpage is slow. The call traverses three clusters. Where is the latency?

About — what's the real cost?

The manual path's failure mode is correlation. Each cluster has its own Prometheus, its own Grafana, its own Jaeger. When the call crosses clusters, the trace ID is preserved (Istio is good about this) but joining the segments requires either federated metrics storage (Thanos / Cortex / Mimir) that you set up yourself, or per-cluster tab-switching while your pager is going off.

The Gloo UI on the management plane consolidates this: service topology across the mesh, aggregated metrics, cluster-health view, and the cross-cluster trace joined for you. It's the same Prometheus data underneath — the difference is who does the federation.

Manual

What the operator does at 2am

Open three Grafana tabs. Open three Jaeger tabs. Join the trace IDs by hand. Reproduce the call. Eyeball.

# Per-cluster setup that the operator built and now maintains:
#   - Prometheus + node-exporter + kube-state-metrics  ×N clusters
#   - Grafana + dashboards                              ×N clusters
#   - Jaeger / Tempo                                    ×N clusters
#   - (optionally) Thanos / Mimir to federate metrics  — built by you
#
# Investigation flow:
#   1. Open Grafana on cluster-1, find the failing request
#   2. Note the trace ID
#   3. Open Jaeger on cluster-1, find the segment that exits the cluster
#   4. Switch to Jaeger on cluster-2, find the segment that enters
#   5. Repeat for cluster-3
#   6. Reconstruct the timeline in your head
Real cost: three tabs and a notebook. Trace IDs do correlate (Istio preserves them) but the joining is on you, every time.
Solo Enterprise for Istio

What the operator does at 2am

Open the Gloo UI on the mgmt cluster. Look at the service-topology view. The cross-cluster trace is already joined; per-hop latency is on the edge.

# One UI, fed by the mgmt plane's telemetry pipeline.
# Service topology across the whole mesh, with per-edge latency.
# Cluster-health view: which agents are connected, which aren't.
# Aggregated metrics keyed by service, namespace, AND cluster.

# Same Prometheus data underneath — the difference is that the
# mgmt plane federates it for you.

kubectl --context mgmt -n gloo-mesh port-forward svc/gloo-mesh-ui 8090:8090
# Open http://localhost:8090
Real cost: one UI to teach to L2 on-call. The Grafana / Prometheus stack is still there underneath if you want it; the mgmt UI sits on top.

Architecture comparison

The visual story behind why the manual path doesn't scale: every edge in the manual diagram is a cross-cluster kubectl apply the operator has to do (and re-do, and audit). In the managed diagram every edge is automated.

Manual — complete graph: N×(N−1) trust + secret exchange edges

            ┌──────────────┐                ┌──────────────┐
            │  cluster-1   │ ◄────────────► │  cluster-2   │
            │  istiod-gloo │   remote-sec   │  istiod-gloo │
            │  east-west   │   peer-bundle  │  east-west   │
            └──────┬───────┘                └──────┬───────┘
                   │                               │
                   │   remote-sec, peer-bundle     │
                   ▼                               ▼
            ┌──────────────────────────────────────────────┐
            │              cluster-3                       │
            │  istiod-gloo, east-west GW, cacerts Secret   │
            └──────────────────────────────────────────────┘

  Each ◄────► edge is TWO kubectl applies + the cert material to back them.
  For N clusters the operator owns N*(N-1) such edges.
  Adding cluster-4: 6 NEW edges (2 per existing peer).
  Rotating root CA: every cacerts Secret on every cluster, in lockstep.

Solo Enterprise for Istio — star: N workload clusters register with one mgmt plane

       ┌────────────────────────────────────────────────────┐
       │   Solo Enterprise for Istio management plane       │
       │                                                    │
       │   KubernetesCluster CRs                            │
       │   RootTrustPolicy                                  │
       │   AccessPolicy / Workspace                         │
       │   Segment  (Ambient federation)                    │
       │   Gloo UI  (federated metrics)                     │
       └──────┬───────────────┬───────────────┬─────────────┘
              │               │               │
              ▼               ▼               ▼
       ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
       │ cluster-1   │  │ cluster-2   │  │ cluster-3   │
       │ gloo-mesh-  │  │ gloo-mesh-  │  │ gloo-mesh-  │
       │ agent       │  │ agent       │  │ agent       │
       │ istiod-gloo │  │ istiod-gloo │  │ istiod-gloo │
       │ east-west   │  │ east-west   │  │ east-west   │
       └─────────────┘  └─────────────┘  └─────────────┘

  Each line is one mTLS pipe the mgmt plane owns end-to-end.
  Adding cluster-4: ONE new edge (mgmt → cluster-4).
  Rotating root CA: ONE RootTrustPolicy change, mgmt plane rolls it.
  Data-plane (cluster-to-cluster HBONE) still flows direct — same data plane,
  the mgmt plane is for trust + policy + audit, not for proxying traffic.

When to use which

The manual path is genuinely fine for some real use-cases — the management plane isn't always the right answer. Below is the decision matrix: pick the column whose row matches your situation.

Use manual when… Use Solo Enterprise for Istio when…
Single-cluster mesh (peering doesn't apply; no Day-2 multi-cluster surface to manage) More than 2 clusters in the same mesh — the operational cost crosses over fast
POC, learning, internal demo where the value is "see the plumbing" Production with compliance / audit requirements — SOC 2, PCI, DORA, HIPAA. The audit trail is the feature
Two clusters that won't change for the lifetime of the project Clusters added / removed on a normal Day-2 cadence (regions, M&A, tenants)
You have a strong PKI team running cert-manager + Vault and can write the controllers that distribute intermediates yourself You want certificate lifecycle managed for you — rotation, revocation, Vault / AWS PCA integration, and rotation reporting
Your goal is "show my SREs how the *.mesh.internal hostname works" and stop there You want one policy (AccessPolicy, Workspace) to govern N clusters from one source of truth
You're rebuilding the cluster every Friday because it's a dev environment and forever is "until 5pm" You need to answer "who registered cluster X, when, with what trust domain" in one kubectl get
A simple way to decide. Think about which three questions you'd ask your platform team first about a multi-cluster mesh. Same data plane underneath either way. The choice is about operator experience and production safety, not about the mesh's runtime behaviour.

How you'd actually install it

This section walks through the install as it ships on Solo Enterprise for Istio 2.12.4. The shape is the same in production (on cloud Kubernetes or on-prem); local-dev with kind has one extra step around east-west GW exposure, called out explicitly below.

About — what this does & why

What: A two-step install — first the management plane goes onto its own cluster (typically a dedicated, low-traffic cluster, since it's a control plane in its own right), then each workload cluster registers with it via its relay agent connecting back to the mgmt-server.

Why: The split keeps the blast radius of management-plane changes off your workload clusters. The relay-agent on each workload cluster is the only mgmt-plane component that lives next to your apps.

# 1. CRDs on EVERY cluster (mgmt + workload). installEnterpriseCrds=true is
#    what brings in AccessPolicy / Workspace / WorkspaceSettings — the CRs the
#    mgmt plane uses to translate central policy into per-cluster
#    AuthorizationPolicy. The Solo documented Ambient pattern uses
#    installEnterpriseCrds=false; do NOT use that pattern if you want
#    centralised RBAC — flip to true and resolve any agentgateway-crds
#    co-install conflict (the two charts collide on authconfigs +
#    ratelimitconfigs; uninstall enterprise-agentgateway on the mgmt cluster
#    or keep agentgateway off the mgmt-plane cluster entirely).
helm upgrade -i gloo-platform-crds gloo-platform/gloo-platform-crds \
  --kube-context $MGMT -n gloo-mesh --create-namespace \
  --version 2.12.4 \
  --set installEnterpriseCrds=true \
  --set featureGates.ConfigDistribution=true     # mgmt only

helm upgrade -i gloo-platform-crds gloo-platform/gloo-platform-crds \
  --kube-context $WORKLOAD -n gloo-mesh --create-namespace \
  --version 2.12.4 \
  --set installEnterpriseCrds=true

# 2. Management plane (mgmt-server + UI + Prometheus + relay + cert-gen job)
#    on the mgmt cluster. The chart generates the relay-*-tls-secret family
#    on first install; never partial-rollout these (an out-of-sync chain
#    across relay-root / relay-server / relay-client breaks the agent
#    handshake — clean uninstall + reinstall is the production-safe recovery).
helm upgrade -i gloo-platform gloo-platform/gloo-platform \
  --kube-context $MGMT -n gloo-mesh \
  --version 2.12.4 -f mgmt-values.yaml \
  --set licensing.glooMeshLicenseKey=$GLOO_MESH_LICENSE_KEY

# 3. Each workload cluster: cross-apply the relay TLS secrets from the mgmt
#    cluster, then install the agent. The agent dials the mgmt-server over
#    mTLS using those secrets — its serverAddress points at the mgmt-server
#    Service IP (LoadBalancer recommended for cross-host reach).
kubectl --context $MGMT -n gloo-mesh get secret relay-root-tls-secret \
  -o yaml | kubectl --context $WORKLOAD -n gloo-mesh apply -f -
# (repeat for relay-client-tls-secret + relay-identity-token-secret)

helm upgrade -i gloo-platform gloo-platform/gloo-platform \
  --kube-context $WORKLOAD -n gloo-mesh \
  --version 2.12.4 -f agent-values.yaml \
  --set licensing.glooMeshLicenseKey=$GLOO_MESH_LICENSE_KEY

# 4. Register the workload cluster on the mgmt plane (one CR).
kubectl --context $MGMT apply -f - <<'EOF'
apiVersion: admin.gloo.solo.io/v2
kind: KubernetesCluster
metadata: { name: prod-eu-west-3, namespace: gloo-mesh }
spec:    { clusterDomain: cluster.local }
EOF

# 5. Verify on the mgmt cluster:
kubectl --context $MGMT -n gloo-mesh get kubernetescluster
# NAME              STATUS
# mgmt              ACCEPTED
# prod-eu-west-3    ACCEPTED
About — four install details worth knowing up front

The shape of the install is straightforward. Four details are worth knowing before you start, because they cross multiple layers of the stack and aren't obvious from any one README:

  1. Prefer Helm directly over meshctl install --profile gloo-mesh-mgmt for the mgmt+agent topology. The combined profile is intended for single-cluster demos; for a production multi-cluster install with explicit values, helm install gloo-platform (as shown above) gives you the same result with fewer layers and a values file you can commit to git. This is also the path that super-quick.sh in this repo codifies.
  2. Set the mgmt-server Service to LoadBalancer when the agent runs in a different cluster. The chart's default Service type is ClusterIP, which is sufficient when the agent runs in the mgmt cluster. The moment a workload cluster needs to reach the relay across a network boundary, the agent needs a routable target — LoadBalancer (MetalLB on kind, a real cloud LB in production) or an Ingress in front of port 9900.
  3. The relay-*-tls-secret family must stay in lockstep. relay-root-tls-secret, relay-server-tls-secret, relay-client-tls-secret, relay-tls-signing-secret, and relay-identity-token-secret are generated by the chart's cert-gen Job on first install and form one cryptographic chain. The mgmt-server reads them by name via command-line flags rather than by volume mount, so a kubectl rollout restart won't re-pick them up if one drifts. If the agent handshake starts failing with "authentication handshake failed: EOF", the production-safe recovery is helm uninstall gloo-platform followed by a fresh helm install so the cert-gen Job re-runs and produces the whole family atomically.
  4. Mint a per-cluster relay-client-tls-secret; don't copy the mgmt cluster's. The chart's cert-gen Job bakes the mgmt cluster's name into the certificate as CN=<mgmt-cluster> with a matching SAN. If you cross-apply that secret verbatim to a workload cluster, the mgmt-server's mTLS handshake sees the wrong CN and bins that workload's inventory under the mgmt cluster's name — the Workspace ends up with numSelectedClusters: 1, the AccessPolicy translation never lands on the workload, and you'll spend a long afternoon chasing it. The mgmt-server's relay-tls-signing-secret is the CA that should mint per-cluster client certs; that's what meshctl cluster register does internally. super-quick.sh performs the equivalent step inline with openssl so the workload cluster's relay-client-tls-secret carries CN=<workload-cluster> as it should.

Cross-cluster peering on Ambient: auto vs manual

Once both clusters are registered with the mgmt plane, they need data-plane peering — the istio-remote Gateway in each cluster that points at the other's east-west GW so HBONE traffic can flow. In Solo Enterprise for Istio 2.12 there are two paths to wire this. Both ride on the same shared root CA + per-cluster intermediates (in the cacerts Secret) that the manual lab on the left already establishes — Ambient's trust pipeline is bootstrapped outside the mgmt plane, which is also why the older RootTrustPolicy CR is documented as superseded in the Ambient install guide.

Auto-peering (preview)

One feature gate, mgmt plane wires it

Set featureGates.ConfigDistribution=true on the mgmt-cluster gloo-platform release and PEERING_AUTOMATIC_LOCAL_GATEWAY=true on every istiod. The mgmt server picks up each cluster's local east-west Gateway, replicates a matching istio-remote peer Gateway into every other registered cluster, and keeps them in sync as clusters come and go.

helm upgrade gloo-platform gloo-platform/gloo-platform \
  --kube-context $MGMT -n gloo-mesh \
  --set featureGates.ConfigDistribution=true

kubectl --context $C -n istio-system patch deploy istiod-gloo --type=json \
  -p='[{"op":"add","path":"/spec/template/spec/containers/0/env/-",
        "value":{"name":"PEERING_AUTOMATIC_LOCAL_GATEWAY","value":"true"}}]'
Quote from the Solo use-case (docs.solo.io, multi-cluster/automatic-peering-ambient-multicluster.md): "This feature is beta. Do not use it in production." Treat it as a preview that will become the default — useful for greenfield clouds where the east-west Gateway's LoadBalancer EXTERNAL-IP is routable from every peer cluster, the assumption it bakes in.
Manual peer Gateway CRs (GA)

One Gateway per peer, explicit addresses

Apply an istio-remote Gateway in each cluster's istio-eastwest namespace for each peer, with the peer's east-west GW address hard-coded. This is the GA pattern (multi-cluster/east-west-gateway-peering.md) and the one that works regardless of whether peer addresses are LoadBalancer IPs, NodePort, or — in our kind-on-two-Macs case — a LAN IP republished by socat.

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: manual-peer-west-mini
  namespace: istio-eastwest
  annotations:
    gateway.istio.io/service-account: istio-eastwest   # MUST match the peer's
    gateway.istio.io/trust-domain: cluster.local       # east-west GW SA — SPIFFE
  labels:                                              # SAN verification uses this
    topology.istio.io/cluster: west-mini
    topology.istio.io/network: west-mini
spec:
  gatewayClassName: istio-remote
  addresses:
    - { type: IPAddress, value: 192.168.1.18 }         # peer's east-west GW
  listeners:                                            # LAN-routable IP
    - { name: cross-network, port: 15008, protocol: HBONE, tls: { mode: Passthrough }, allowedRoutes: { namespaces: { from: Same } } }
    - { name: xds-tls,       port: 15012, protocol: TLS,   tls: { mode: Passthrough }, allowedRoutes: { namespaces: { from: Same } } }
The annotation is load-bearing. Without gateway.istio.io/service-account: istio-eastwest the ztunnel client expects the SPIFFE SAN to derive from the Gateway's name, not the peer's actual east-west GW ServiceAccount — mTLS fails with "peer did not present the expected SAN". It's a one-line workaround for what is, otherwise, the cleanest path.
Local dev caveat: kind on two Macs. Both peering paths rely on each cluster's east-west Gateway being reachable from its peers via the address advertised in .status.loadBalancer.ingress[0]. On kind that address is a Docker-bridge IP (e.g. 172.18.255.100), which is not routable from another physical host. For local-dev scenarios where you want to mirror the production topology on two laptops, the repo includes helpers: scripts/expose-ew-on-host.sh republishes the east-west GW on the host's LAN IP via a socat container, and scripts/super-quick.sh chains that helper with the manual peer-Gateway pattern (LAN IPs + the gateway.istio.io/service-account annotation) to stand the whole topology up in one command:
./scripts/super-quick.sh                       # prompts for SSH user / host
./scripts/super-quick.sh --user <user> --host <host>   # non-interactive

In a cloud environment (AKS, EKS, GKE) this step is unnecessary — the cloud LoadBalancer EXTERNAL-IP is already cross-cluster routable, so either the manual peering pattern (with the LB IP) or the auto-peering preview works directly.

Read the docs before you build. Always check docs.solo.io/gloo-mesh-enterprise for the current install steps — the install profiles, Helm chart names, CRD field shapes, and feature-gate maturity (auto-peering will go GA, the manual override may be deprecated) evolve across Solo Enterprise for Istio minor versions. As of 2.12 the management-plane charts are gloo-platform and gloo-platform-crds in the gloo-mesh namespace.

Where to next