How to collect these metrics
Port-forward to the istiod metrics endpoint
kubectl -n istio-system port-forward svc/istiod-gloo 15014:15014 &
curl -s http://localhost:15014/metrics
kill %1
Everything below comes from http://istiod-gloo.istio-system.svc:15014/metrics.
For anything beyond ad-hoc inspection, drop a ServiceMonitor on port 15014
and let Prometheus do the scraping. The YAML for that is right below.
⬇ service-monitor.yaml · ⬇ prometheus-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: istiod-gloo
namespace: istio-system
spec:
selector:
matchLabels:
app: istiod
istio.io/rev: gloo
endpoints:
- port: http-monitoring # port 15014
interval: 15s
path: /metrics
xDS protocol types — what wds and wads mean
In Ambient mode the old sidecar xDS types (LDS / RDS / CDS / EDS) are replaced for the
data plane by two Ambient-specific APIs — so on a pure Ambient cluster you'll
only see wds and wads in push metrics.
If the classic types are showing up too, there are still sidecars somewhere in the mesh.
Workload Discovery Service
Pushes workload identity and address state to ztunnel — pod IPs, SPIFFE IDs, service VIPs, and endpoint health. ztunnel on every node subscribes to WDS to build its HBONE tunnel routing table. Replaces EDS + CDS for the Ambient data plane.
Waypoint Address Discovery Service
Pushes waypoint Gateway addresses and service-to-waypoint bindings to ztunnel. Tells ztunnel which services have a waypoint and where to send L7-bound traffic before forwarding to the destination. Without WADS, ztunnel bypasses waypoints.
Classic sidecar xDS types
Listener / Route / Cluster / Endpoint Discovery. Only present if sidecar Envoy proxies are connected (non-Ambient workloads). In a pure Ambient cluster these counters stay at zero.
WDS payload type
The protobuf type pushed over WDS. The pilot_xds_config_size_bytes
histogram uses this as a label — lets you track how large each WDS push is as
your workload count grows (expect ~500B–1KB per workload).
Certificate & CA health
istiod is the mesh CA, so it knows exactly when its own certs expire — and it tells you. Root cert expiry is the metric to care about. Rotating a root CA mid-flight is the kind of operation you want to plan and rehearse, not improvise: every intermediate has to be re-issued, and any workload that doesn't pick up the rotation cleanly becomes a P1. Wire the 30-day alert in. Earlier if you can.
| Metric | Type | What it measures | Alert when |
|---|---|---|---|
citadel_server_root_cert_expiry_seconds |
gauge | Seconds until the root CA cert expires. Negative = already expired. | < 2592000 (30 days) |
citadel_server_root_cert_expiry_timestamp |
gauge | Unix timestamp of root cert expiry — useful for dashboards. | < time() + 2592000 |
citadel_server_cert_chain_expiry_seconds |
gauge | Seconds until the istiod-issued workload cert chain expires. | < 86400 (1 day) |
citadel_server_cert_chain_expiry_timestamp |
gauge | Unix timestamp of workload cert chain expiry. | < time() + 86400 |
Multi-cluster connectivity
If you're picking one metric to know whether multicluster is currently working, pick
istiod_managed_clusters. remote=0 on a cluster that should have
peers means istiod has lost the XDS link to the other side and cross-cluster endpoint
rewriting has stopped — east-bound calls will start failing if they haven't already.
Page on this one.
| Metric | Type | What it measures | Alert when |
|---|---|---|---|
istiod_managed_clusters{cluster_type="local"} |
gauge | Always 1 — confirms this istiod is managing its own cluster. | != 1 |
istiod_managed_clusters{cluster_type="remote"} |
gauge | Number of remote clusters this istiod has live XDS connections to. Drops to 0 when the remote secret is missing or the peer istiod is unreachable. | < expected peer count |
istiod_uptime_seconds |
gauge | How long since istiod started. Frequent low values = crash-looping. | rate resets unexpectedly |
istio_build{tag="..."} |
gauge | Always 1 — the tag label carries the Solo Istio version string. |
version changes unexpectedly |
kubectl -n istio-system get secret | grep istio-remote-secret to confirm
the secret is present, then
kubectl logs deploy/istiod-gloo | grep -i "peer\|remote\|delta" for the
underlying error.
Proxy sync & xDS push health
These metrics catch one of the worst failure modes in Ambient: the waiting for sync deadlock. One misbehaving client fails auth, istiod's discovery filter clams up, and every xDS push stops. The counters flatline, the push time histograms quit moving, and workloads keep running on stale config — silently, often for hours. This is the section where the alerts earn their keep.
| Metric | Type | What it measures | Alert when |
|---|---|---|---|
pilot_xds{version="..."} |
gauge | Live XDS connections right now. In a pure Ambient cluster expect 2× ztunnel pods + N waypoints. | drops unexpectedly |
pilot_proxy_convergence_time_count |
histogram counter | Total proxy config pushes that successfully completed (proxy ACK'd). Stops incrementing when the discovery filter is blocked. | rate(…[5m]) == 0 while pilot_xds > 0 |
pilot_proxy_convergence_time_sum / _count |
histogram | Average time from config change to proxy ACK. In a healthy kind cluster <10ms. Above 1s signals push queue pressure. | avg > 1s |
pilot_proxy_queue_time_count |
histogram counter | Proxies dequeued from the push queue. Should match convergence count in a healthy cluster. | diverges from convergence count |
pilot_xds_pushes{type="wds"} |
counter | Total WDS pushes (workload address updates to ztunnel). Stalls when the discovery filter is blocked. | rate(…[5m]) == 0 |
pilot_xds_pushes{type="wads"} |
counter | Total WADS pushes (waypoint address updates to ztunnel). Stalls when discovery filter is blocked. | rate(…[5m]) == 0 |
pilot_xds_push_time_sum / _count{type="wds"} |
histogram | Average time to generate and send a WDS push. Should be <5ms on kind. | avg > 500ms |
pilot_xds_push_time_sum / _count{type="wads"} |
histogram | Average time to generate and send a WADS push. | avg > 500ms |
pilot_xds_recv_max |
gauge | Largest xDS request (ACK/NACK) received from any client in bytes. Useful for detecting unexpectedly large proxy state. | — |
pilot_xds_config_size_bytes{type="istio.workload.Address"} |
histogram | Distribution of WDS push payload sizes. Grows linearly with workload count (~500B–1KB per workload). Watch for sudden spikes. | sudden spike > 2×baseline |
pilot_xds_pushes flat while
pilot_xds{} > 0 — proxies connected, nothing being pushed. The most common
cause is a proxy presenting the wrong CLUSTER_ID. agentgateway waypoints in
particular default to "Kubernetes" instead of the actual cluster name, which
makes istiod refuse to talk to them. Fix: patch the waypoint Deployment with the correct
CLUSTER_ID env var.
Push pipeline & config churn
| Metric | Type | What it measures | Alert when |
|---|---|---|---|
pilot_push_triggers{type="ambient"} |
counter | Push batches triggered by Ambient-specific config changes (ztunnel labels, waypoint changes). | unexpected spike |
pilot_push_triggers{type="endpoint"} |
counter | Push batches from endpoint changes (pod start/stop, rolling deploys). | sustained high rate outside deploy windows |
pilot_push_triggers{type="global"} |
counter | Full mesh-wide push triggers — fired on config changes that affect all proxies (new Policy, new Service). High rate = config storm. | rate > 5/min |
pilot_debounce_time_sum / _count |
histogram | Average time config changes are held in the debounce window before being merged into a single push. High values mean a config storm is generating rapid successive changes. | avg > 1s |
pilot_pushcontext_init_seconds |
histogram | Time to fully rebuild the push context (mesh-wide config snapshot). High values indicate a slow Kubernetes API or very large config surface. | avg > 1s |
pilot_inbound_updates{type="config"} |
counter | Config object change events received from Kubernetes (Gateway, HTTPRoute, Policy, etc.). | rate spike outside deploy windows |
pilot_inbound_updates{type="eds"} |
counter | Endpoint slice change events — spikes on rolling deploys, stays low in steady state. | sustained >10/s |
pilot_services |
gauge | Total services known to istiod (K8s Services + ServiceEntries). Unexpected drop = services removed or istiod lost its K8s watch. | drops unexpectedly |
Errors & config conflicts
| Metric | Type | What it measures | Alert when |
|---|---|---|---|
endpoint_no_pod |
gauge | Endpoint addresses with no matching pod — stale endpoints from crashed pods. Should be 0 in steady state. | > 0 for > 60s |
pilot_eds_no_instances |
gauge | EDS clusters (services) with zero endpoints — traffic to these services will 503. | > 0 |
pilot_endpoint_not_ready |
gauge | Endpoints in unready state (readiness probe failing). Istiod excludes these from traffic — confirm readiness probe config if persistently >0. | > 0 for > 120s |
pilot_no_ip |
gauge | Pods not in the endpoint table — pod/endpoint sync lag or evicted pods. Should be 0. | > 0 |
pilot_conflict_inbound_listener |
gauge | Conflicting inbound listener configurations — two services fighting for the same port. Traffic will be misrouted. | > 0 |
pilot_conflict_outbound_listener_tcp_over_current_tcp |
gauge | Conflicting outbound TCP listeners. Indicates ServiceEntry or Service port collisions. | > 0 |
pilot_duplicate_envoy_clusters |
gauge | Duplicate Envoy cluster names caused by ServiceEntries sharing a hostname. Can cause silent traffic mis-routing. | > 0 |
pilot_destrule_subsets |
gauge | Duplicate DestinationRule subsets across rules targeting the same host. | > 0 |
galley_validation_config_update_error{reason="Conflict"} |
counter | Webhook configuration update conflicts — commonly the Gloo Operator trying to update a webhook config already owned by another controller. | > 0 and increasing |
pilot_k8s_proxies_with_no_service_targets |
counter | Proxies (typically waypoint Gateways) with no matching K8s Service targets. Expected for waypoint-style Gateways that don't back a Service directly. | — |
Process & runtime health
| Metric | Type | What it measures | Alert when |
|---|---|---|---|
go_goroutines |
gauge | Active goroutines in istiod. Normal range at startup: 800–1500. Sustained growth above 3000 indicates a goroutine leak. | > 3000 and growing |
process_resident_memory_bytes |
gauge | RSS memory. istiod typically uses 100–500 MB depending on mesh size. On kind (constrained nodes) watch for OOMKill. | > 800MB on kind |
go_memstats_heap_alloc_bytes |
gauge | Live heap allocation. Sustained growth between GC cycles indicates a memory leak. | sustained growth trend |
process_cpu_seconds_total |
counter | Cumulative CPU time. Use rate(…[5m]) for per-second load. High sustained rate during steady state = config churn or push loop. |
rate > 2 cores steady-state |
process_open_fds |
gauge | Open file descriptors. Each XDS stream + K8s watch consumes an fd. Growth over time = fd leak. | > 80% of process_max_fds |
go_sched_gomaxprocs_threads |
gauge | GOMAXPROCS — number of OS threads istiod can run Go code on. Reflects CPU limit of the pod. | — |
Prometheus alert rules
Paste this into a PrometheusRule CR if you're on kube-prometheus-stack,
or a standalone rules.yaml if not. Every rule lines up with a metric from
the tables above. The dashboard ships with most of these signals already; the alert YAML
doesn't, so it's here.
⬇ Download prometheus-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: istio-ambient-alerts
namespace: monitoring
labels:
prometheus: kube-prometheus
role: alert-rules
spec:
groups:
- name: istio.cert
interval: 60s
rules:
- alert: IstiodRootCertExpiryWarning
expr: citadel_server_root_cert_expiry_seconds < 2592000
for: 5m
labels:
severity: warning
annotations:
summary: "istiod root CA cert expiring in < 30 days"
description: "Root cert expires in {{ humanizeDuration $value }}. Root CA rotation is disruptive — plan ahead."
- alert: IstiodRootCertExpiryCritical
expr: citadel_server_root_cert_expiry_seconds < 604800
for: 5m
labels:
severity: critical
annotations:
summary: "istiod root CA cert expiring in < 7 days"
description: "Root cert expires in {{ humanizeDuration $value }}. Immediate action required."
- alert: IstiodWorkloadCertExpiryWarning
expr: citadel_server_cert_chain_expiry_seconds < 86400
for: 5m
labels:
severity: warning
annotations:
summary: "istiod workload cert chain expiring in < 1 day"
description: "Cert chain expires in {{ humanizeDuration $value }}."
- name: istio.multicluster
interval: 30s
rules:
- alert: IstiodRemoteClusterDisconnected
expr: istiod_managed_clusters{cluster_type="remote"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "istiod has no remote cluster connections"
description: "Cross-cluster endpoint rewriting is broken. Check istio-remote-secret-* in istio-system and peer istiod logs."
- alert: IstiodDown
expr: istiod_managed_clusters{cluster_type="local"} != 1
for: 2m
labels:
severity: critical
annotations:
summary: "istiod is not managing its local cluster"
description: "istiod local cluster gauge != 1. Pod may be crash-looping or metrics endpoint is unreachable."
- name: istio.xds.sync
interval: 30s
rules:
- alert: IstiodXdsPushStall
expr: |
rate(pilot_xds_pushes[5m]) == 0
and
pilot_xds > 0
for: 3m
labels:
severity: critical
annotations:
summary: "istiod xDS pushes have stalled"
description: "{{ $labels.type }} push rate is 0 but {{ $value }} proxies are connected. Discovery filter may be blocked — check istiod logs for 'waiting for sync' and auth errors."
- alert: IstiodProxyConvergenceSlow
expr: |
rate(pilot_proxy_convergence_time_sum[5m])
/
rate(pilot_proxy_convergence_time_count[5m]) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Proxy config convergence avg > 1s"
description: "Average time for a proxy to receive and ACK config is {{ $value | humanizeDuration }}. Push queue may be overloaded."
- alert: IstiodWdsPushSlow
expr: |
rate(pilot_xds_push_time_sum{type="wds"}[5m])
/
rate(pilot_xds_push_time_count{type="wds"}[5m]) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "WDS push time avg > 500ms"
description: "Workload Discovery Service pushes to ztunnel are slow ({{ $value | humanizeDuration }}). May indicate large workload counts or push queue contention."
- name: istio.config.errors
interval: 60s
rules:
- alert: IstiodStaleEndpoints
expr: endpoint_no_pod > 0
for: 60s
labels:
severity: warning
annotations:
summary: "{{ $value }} endpoints with no backing pod"
description: "Stale endpoint entries — crashed pod endpoints not cleaned up. Traffic to these endpoints will fail."
- alert: IstiodEmptyService
expr: pilot_eds_no_instances > 0
for: 30s
labels:
severity: warning
annotations:
summary: "{{ $value }} services have zero endpoints"
description: "Services with no endpoints will return 503. May indicate a deployment failure or misconfigured selector."
- alert: IstiodListenerConflict
expr: pilot_conflict_inbound_listener > 0 or pilot_conflict_outbound_listener_tcp_over_current_tcp > 0
for: 5m
labels:
severity: warning
annotations:
summary: "istiod listener conflicts detected"
description: "{{ $value }} conflicting listeners. Traffic may be misrouted. Check for Service port collisions or duplicate ServiceEntries."
- alert: IstiodConfigPushStorm
expr: rate(pilot_push_triggers{type="global"}[5m]) > 0.08
for: 5m
labels:
severity: warning
annotations:
summary: "istiod global push rate > 5/min"
description: "Frequent full mesh-wide pushes indicate a config storm. Check for a controller repeatedly updating CRDs."
- name: istio.process
interval: 60s
rules:
- alert: IstiodGoroutineLeak
expr: go_goroutines{job="istiod"} > 3000
for: 10m
labels:
severity: warning
annotations:
summary: "istiod goroutine count > 3000"
description: "{{ $value }} goroutines. Sustained growth suggests a goroutine leak. Restart istiod if count keeps growing."
- alert: IstiodHighMemory
expr: process_resident_memory_bytes{job="istiod"} > 838860800
for: 10m
labels:
severity: warning
annotations:
summary: "istiod RSS > 800MB"
description: "{{ $value | humanize1024 }}B resident memory. On resource-constrained nodes this risks OOMKill."
- alert: IstiodFdExhaustion
expr: process_open_fds{job="istiod"} / process_max_fds{job="istiod"} > 0.8
for: 5m
labels:
severity: critical
annotations:
summary: "istiod file descriptor usage > 80%"
description: "{{ $value | humanizePercentage }} of max FDs in use. Approaching exhaustion will cause new XDS stream failures."
Diagnostic commands
Is the discovery filter stuck?
# Find the sync wait (grep past the noise to the root cause)
kubectl -n istio-system logs deploy/istiod-gloo --tail=300 \
| grep -v "waiting for sync" \
| grep -iE "error|auth|cluster"
# Common root cause line:
# "client claims to be in cluster \"Kubernetes\", but we only know about
# local cluster \"east-ag\" and remote clusters [west-ag]"
# Fix: patch the waypoint Deployment with CLUSTER_ID=<cluster-name>
Which proxies are connected and synced?
istioctl --context $CLUSTER1 proxy-status
# SYNCED = healthy
# STALE = config pushed but not yet ACK'd
# NOT SENT = istiod hasn't pushed config at all (stuck filter)
Is the remote cluster peering alive?
istioctl multicluster check --verbose \
--contexts="${CLUSTER1},${CLUSTER2}"
# Checks: license, pod health, east-west gateway programmed,
# PeeringSucceeded, PeerConnected, PeerDataPlaneProgrammed
Required peering env vars — verify they're set
for CTX in $CLUSTER1 $CLUSTER2; do
echo "=== $CTX ==="
echo "--- istiod (need PILOT_ENABLE_K8S_SELECT_WORKLOAD_ENTRIES=false) ---"
kubectl --context $CTX get deploy istiod-gloo -n istio-system \
-o jsonpath='{range .spec.template.spec.containers[0].env[*]}{.name}={.value}{"\n"}{end}' \
| grep -E "K8S_SELECT|PEERING|CLUSTER_ID|LICENSE"
echo "--- ztunnel (need L7_ENABLED=true) ---"
kubectl --context $CTX get ds ztunnel -n istio-system \
-o jsonpath='{range .spec.template.spec.containers[0].env[*]}{.name}={.value}{"\n"}{end}' \
| grep -E "L7_ENABLED|NETWORK|CLUSTER_ID"
done
ztunnel workload and waypoint view
ZTUNNEL=$(kubectl -n istio-system get pod -l app=ztunnel -o name | head -1 | sed 's|pod/||')
# All workloads ztunnel knows about (including cross-cluster)
istioctl ztunnel-config workloads $ZTUNNEL -n istio-system
# Services and their waypoints
istioctl ztunnel-config services $ZTUNNEL -n istio-system