MastertheMesh
Solo Enterprise for Istio · Observability
How-To

Istio Ambient Metrics & Alerting

TO
Tom O'Rourke
EMEA Field CTO · Solo.io

Field reference for the Prometheus metrics exposed by istiod-gloo that actually matter at runtime — what each one means, what shape "healthy" looks like, and the PromQL to alert on when it isn't.

istiod · :15014/metrics WDS · WADS Prometheus · AlertManager Solo Enterprise for Istio 1.29
🔬

Demo's still cooking. The full Prometheus + Alertmanager wiring is on the way. The metrics below were pulled from a live two-cluster kind setup running Solo Enterprise for Istio Ambient, and the alert YAML at the bottom of the page is ready to drop into a real cluster.

How to collect these metrics

Port-forward to the istiod metrics endpoint

kubectl -n istio-system port-forward svc/istiod-gloo 15014:15014 &
curl -s http://localhost:15014/metrics
kill %1

Everything below comes from http://istiod-gloo.istio-system.svc:15014/metrics. For anything beyond ad-hoc inspection, drop a ServiceMonitor on port 15014 and let Prometheus do the scraping. The YAML for that is right below.

⬇ service-monitor.yaml  ·  ⬇ prometheus-alerts.yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: istiod-gloo
  namespace: istio-system
spec:
  selector:
    matchLabels:
      app: istiod
      istio.io/rev: gloo
  endpoints:
  - port: http-monitoring   # port 15014
    interval: 15s
    path: /metrics

xDS protocol types — what wds and wads mean

In Ambient mode the old sidecar xDS types (LDS / RDS / CDS / EDS) are replaced for the data plane by two Ambient-specific APIs — so on a pure Ambient cluster you'll only see wds and wads in push metrics. If the classic types are showing up too, there are still sidecars somewhere in the mesh.

wds

Workload Discovery Service

Pushes workload identity and address state to ztunnel — pod IPs, SPIFFE IDs, service VIPs, and endpoint health. ztunnel on every node subscribes to WDS to build its HBONE tunnel routing table. Replaces EDS + CDS for the Ambient data plane.

wads

Waypoint Address Discovery Service

Pushes waypoint Gateway addresses and service-to-waypoint bindings to ztunnel. Tells ztunnel which services have a waypoint and where to send L7-bound traffic before forwarding to the destination. Without WADS, ztunnel bypasses waypoints.

lds / rds / cds / eds

Classic sidecar xDS types

Listener / Route / Cluster / Endpoint Discovery. Only present if sidecar Envoy proxies are connected (non-Ambient workloads). In a pure Ambient cluster these counters stay at zero.

type.googleapis.com/istio.workload.Address

WDS payload type

The protobuf type pushed over WDS. The pilot_xds_config_size_bytes histogram uses this as a label — lets you track how large each WDS push is as your workload count grows (expect ~500B–1KB per workload).

Certificate & CA health

istiod is the mesh CA, so it knows exactly when its own certs expire — and it tells you. Root cert expiry is the metric to care about. Rotating a root CA mid-flight is the kind of operation you want to plan and rehearse, not improvise: every intermediate has to be re-issued, and any workload that doesn't pick up the rotation cleanly becomes a P1. Wire the 30-day alert in. Earlier if you can.

Metric Type What it measures Alert when
citadel_server_root_cert_expiry_seconds gauge Seconds until the root CA cert expires. Negative = already expired. < 2592000 (30 days)
citadel_server_root_cert_expiry_timestamp gauge Unix timestamp of root cert expiry — useful for dashboards. < time() + 2592000
citadel_server_cert_chain_expiry_seconds gauge Seconds until the istiod-issued workload cert chain expires. < 86400 (1 day)
citadel_server_cert_chain_expiry_timestamp gauge Unix timestamp of workload cert chain expiry. < time() + 86400
Demo vs prod, on the numbers: the self-signed certs kind spins up (and what this repo uses) are good for 10 years, so the gauges barely move. In prod with BYO-CA intermediates the cert chain is typically 1–5 years, and the 30-day warning is the one that gives you a usable window to rotate.

Multi-cluster connectivity

If you're picking one metric to know whether multicluster is currently working, pick istiod_managed_clusters. remote=0 on a cluster that should have peers means istiod has lost the XDS link to the other side and cross-cluster endpoint rewriting has stopped — east-bound calls will start failing if they haven't already. Page on this one.

Metric Type What it measures Alert when
istiod_managed_clusters{cluster_type="local"} gauge Always 1 — confirms this istiod is managing its own cluster. != 1
istiod_managed_clusters{cluster_type="remote"} gauge Number of remote clusters this istiod has live XDS connections to. Drops to 0 when the remote secret is missing or the peer istiod is unreachable. < expected peer count
istiod_uptime_seconds gauge How long since istiod started. Frequent low values = crash-looping. rate resets unexpectedly
istio_build{tag="..."} gauge Always 1 — the tag label carries the Solo Istio version string. version changes unexpectedly
When this alert fires: after an istiod pod restart, give the remote secret reconnect 2–3 minutes before treating it as an incident — it isn't instant. Still 0 after that? Two places to look. First kubectl -n istio-system get secret | grep istio-remote-secret to confirm the secret is present, then kubectl logs deploy/istiod-gloo | grep -i "peer\|remote\|delta" for the underlying error.

Proxy sync & xDS push health

These metrics catch one of the worst failure modes in Ambient: the waiting for sync deadlock. One misbehaving client fails auth, istiod's discovery filter clams up, and every xDS push stops. The counters flatline, the push time histograms quit moving, and workloads keep running on stale config — silently, often for hours. This is the section where the alerts earn their keep.

Metric Type What it measures Alert when
pilot_xds{version="..."} gauge Live XDS connections right now. In a pure Ambient cluster expect 2× ztunnel pods + N waypoints. drops unexpectedly
pilot_proxy_convergence_time_count histogram counter Total proxy config pushes that successfully completed (proxy ACK'd). Stops incrementing when the discovery filter is blocked. rate(…[5m]) == 0 while pilot_xds > 0
pilot_proxy_convergence_time_sum / _count histogram Average time from config change to proxy ACK. In a healthy kind cluster <10ms. Above 1s signals push queue pressure. avg > 1s
pilot_proxy_queue_time_count histogram counter Proxies dequeued from the push queue. Should match convergence count in a healthy cluster. diverges from convergence count
pilot_xds_pushes{type="wds"} counter Total WDS pushes (workload address updates to ztunnel). Stalls when the discovery filter is blocked. rate(…[5m]) == 0
pilot_xds_pushes{type="wads"} counter Total WADS pushes (waypoint address updates to ztunnel). Stalls when discovery filter is blocked. rate(…[5m]) == 0
pilot_xds_push_time_sum / _count{type="wds"} histogram Average time to generate and send a WDS push. Should be <5ms on kind. avg > 500ms
pilot_xds_push_time_sum / _count{type="wads"} histogram Average time to generate and send a WADS push. avg > 500ms
pilot_xds_recv_max gauge Largest xDS request (ACK/NACK) received from any client in bytes. Useful for detecting unexpectedly large proxy state.
pilot_xds_config_size_bytes{type="istio.workload.Address"} histogram Distribution of WDS push payload sizes. Grows linearly with workload count (~500B–1KB per workload). Watch for sudden spikes. sudden spike > 2×baseline
"Waiting for sync" isn't a metric. It only shows up in istiod logs. The signal you can alert on is pilot_xds_pushes flat while pilot_xds{} > 0 — proxies connected, nothing being pushed. The most common cause is a proxy presenting the wrong CLUSTER_ID. agentgateway waypoints in particular default to "Kubernetes" instead of the actual cluster name, which makes istiod refuse to talk to them. Fix: patch the waypoint Deployment with the correct CLUSTER_ID env var.

Push pipeline & config churn

Metric Type What it measures Alert when
pilot_push_triggers{type="ambient"} counter Push batches triggered by Ambient-specific config changes (ztunnel labels, waypoint changes). unexpected spike
pilot_push_triggers{type="endpoint"} counter Push batches from endpoint changes (pod start/stop, rolling deploys). sustained high rate outside deploy windows
pilot_push_triggers{type="global"} counter Full mesh-wide push triggers — fired on config changes that affect all proxies (new Policy, new Service). High rate = config storm. rate > 5/min
pilot_debounce_time_sum / _count histogram Average time config changes are held in the debounce window before being merged into a single push. High values mean a config storm is generating rapid successive changes. avg > 1s
pilot_pushcontext_init_seconds histogram Time to fully rebuild the push context (mesh-wide config snapshot). High values indicate a slow Kubernetes API or very large config surface. avg > 1s
pilot_inbound_updates{type="config"} counter Config object change events received from Kubernetes (Gateway, HTTPRoute, Policy, etc.). rate spike outside deploy windows
pilot_inbound_updates{type="eds"} counter Endpoint slice change events — spikes on rolling deploys, stays low in steady state. sustained >10/s
pilot_services gauge Total services known to istiod (K8s Services + ServiceEntries). Unexpected drop = services removed or istiod lost its K8s watch. drops unexpectedly

Errors & config conflicts

Metric Type What it measures Alert when
endpoint_no_pod gauge Endpoint addresses with no matching pod — stale endpoints from crashed pods. Should be 0 in steady state. > 0 for > 60s
pilot_eds_no_instances gauge EDS clusters (services) with zero endpoints — traffic to these services will 503. > 0
pilot_endpoint_not_ready gauge Endpoints in unready state (readiness probe failing). Istiod excludes these from traffic — confirm readiness probe config if persistently >0. > 0 for > 120s
pilot_no_ip gauge Pods not in the endpoint table — pod/endpoint sync lag or evicted pods. Should be 0. > 0
pilot_conflict_inbound_listener gauge Conflicting inbound listener configurations — two services fighting for the same port. Traffic will be misrouted. > 0
pilot_conflict_outbound_listener_tcp_over_current_tcp gauge Conflicting outbound TCP listeners. Indicates ServiceEntry or Service port collisions. > 0
pilot_duplicate_envoy_clusters gauge Duplicate Envoy cluster names caused by ServiceEntries sharing a hostname. Can cause silent traffic mis-routing. > 0
pilot_destrule_subsets gauge Duplicate DestinationRule subsets across rules targeting the same host. > 0
galley_validation_config_update_error{reason="Conflict"} counter Webhook configuration update conflicts — commonly the Gloo Operator trying to update a webhook config already owned by another controller. > 0 and increasing
pilot_k8s_proxies_with_no_service_targets counter Proxies (typically waypoint Gateways) with no matching K8s Service targets. Expected for waypoint-style Gateways that don't back a Service directly.

Process & runtime health

Metric Type What it measures Alert when
go_goroutines gauge Active goroutines in istiod. Normal range at startup: 800–1500. Sustained growth above 3000 indicates a goroutine leak. > 3000 and growing
process_resident_memory_bytes gauge RSS memory. istiod typically uses 100–500 MB depending on mesh size. On kind (constrained nodes) watch for OOMKill. > 800MB on kind
go_memstats_heap_alloc_bytes gauge Live heap allocation. Sustained growth between GC cycles indicates a memory leak. sustained growth trend
process_cpu_seconds_total counter Cumulative CPU time. Use rate(…[5m]) for per-second load. High sustained rate during steady state = config churn or push loop. rate > 2 cores steady-state
process_open_fds gauge Open file descriptors. Each XDS stream + K8s watch consumes an fd. Growth over time = fd leak. > 80% of process_max_fds
go_sched_gomaxprocs_threads gauge GOMAXPROCS — number of OS threads istiod can run Go code on. Reflects CPU limit of the pod.

Prometheus alert rules

Paste this into a PrometheusRule CR if you're on kube-prometheus-stack, or a standalone rules.yaml if not. Every rule lines up with a metric from the tables above. The dashboard ships with most of these signals already; the alert YAML doesn't, so it's here.

⬇ Download prometheus-alerts.yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: istio-ambient-alerts
  namespace: monitoring
  labels:
    prometheus: kube-prometheus
    role: alert-rules
spec:
  groups:

  - name: istio.cert
    interval: 60s
    rules:

    - alert: IstiodRootCertExpiryWarning
      expr: citadel_server_root_cert_expiry_seconds < 2592000
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "istiod root CA cert expiring in < 30 days"
        description: "Root cert expires in {{ humanizeDuration $value }}. Root CA rotation is disruptive — plan ahead."

    - alert: IstiodRootCertExpiryCritical
      expr: citadel_server_root_cert_expiry_seconds < 604800
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "istiod root CA cert expiring in < 7 days"
        description: "Root cert expires in {{ humanizeDuration $value }}. Immediate action required."

    - alert: IstiodWorkloadCertExpiryWarning
      expr: citadel_server_cert_chain_expiry_seconds < 86400
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "istiod workload cert chain expiring in < 1 day"
        description: "Cert chain expires in {{ humanizeDuration $value }}."

  - name: istio.multicluster
    interval: 30s
    rules:

    - alert: IstiodRemoteClusterDisconnected
      expr: istiod_managed_clusters{cluster_type="remote"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "istiod has no remote cluster connections"
        description: "Cross-cluster endpoint rewriting is broken. Check istio-remote-secret-* in istio-system and peer istiod logs."

    - alert: IstiodDown
      expr: istiod_managed_clusters{cluster_type="local"} != 1
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "istiod is not managing its local cluster"
        description: "istiod local cluster gauge != 1. Pod may be crash-looping or metrics endpoint is unreachable."

  - name: istio.xds.sync
    interval: 30s
    rules:

    - alert: IstiodXdsPushStall
      expr: |
        rate(pilot_xds_pushes[5m]) == 0
        and
        pilot_xds > 0
      for: 3m
      labels:
        severity: critical
      annotations:
        summary: "istiod xDS pushes have stalled"
        description: "{{ $labels.type }} push rate is 0 but {{ $value }} proxies are connected. Discovery filter may be blocked — check istiod logs for 'waiting for sync' and auth errors."

    - alert: IstiodProxyConvergenceSlow
      expr: |
        rate(pilot_proxy_convergence_time_sum[5m])
        /
        rate(pilot_proxy_convergence_time_count[5m]) > 1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Proxy config convergence avg > 1s"
        description: "Average time for a proxy to receive and ACK config is {{ $value | humanizeDuration }}. Push queue may be overloaded."

    - alert: IstiodWdsPushSlow
      expr: |
        rate(pilot_xds_push_time_sum{type="wds"}[5m])
        /
        rate(pilot_xds_push_time_count{type="wds"}[5m]) > 0.5
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "WDS push time avg > 500ms"
        description: "Workload Discovery Service pushes to ztunnel are slow ({{ $value | humanizeDuration }}). May indicate large workload counts or push queue contention."

  - name: istio.config.errors
    interval: 60s
    rules:

    - alert: IstiodStaleEndpoints
      expr: endpoint_no_pod > 0
      for: 60s
      labels:
        severity: warning
      annotations:
        summary: "{{ $value }} endpoints with no backing pod"
        description: "Stale endpoint entries — crashed pod endpoints not cleaned up. Traffic to these endpoints will fail."

    - alert: IstiodEmptyService
      expr: pilot_eds_no_instances > 0
      for: 30s
      labels:
        severity: warning
      annotations:
        summary: "{{ $value }} services have zero endpoints"
        description: "Services with no endpoints will return 503. May indicate a deployment failure or misconfigured selector."

    - alert: IstiodListenerConflict
      expr: pilot_conflict_inbound_listener > 0 or pilot_conflict_outbound_listener_tcp_over_current_tcp > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "istiod listener conflicts detected"
        description: "{{ $value }} conflicting listeners. Traffic may be misrouted. Check for Service port collisions or duplicate ServiceEntries."

    - alert: IstiodConfigPushStorm
      expr: rate(pilot_push_triggers{type="global"}[5m]) > 0.08
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "istiod global push rate > 5/min"
        description: "Frequent full mesh-wide pushes indicate a config storm. Check for a controller repeatedly updating CRDs."

  - name: istio.process
    interval: 60s
    rules:

    - alert: IstiodGoroutineLeak
      expr: go_goroutines{job="istiod"} > 3000
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "istiod goroutine count > 3000"
        description: "{{ $value }} goroutines. Sustained growth suggests a goroutine leak. Restart istiod if count keeps growing."

    - alert: IstiodHighMemory
      expr: process_resident_memory_bytes{job="istiod"} > 838860800
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "istiod RSS > 800MB"
        description: "{{ $value | humanize1024 }}B resident memory. On resource-constrained nodes this risks OOMKill."

    - alert: IstiodFdExhaustion
      expr: process_open_fds{job="istiod"} / process_max_fds{job="istiod"} > 0.8
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "istiod file descriptor usage > 80%"
        description: "{{ $value | humanizePercentage }} of max FDs in use. Approaching exhaustion will cause new XDS stream failures."

Diagnostic commands

Is the discovery filter stuck?

# Find the sync wait (grep past the noise to the root cause)
kubectl -n istio-system logs deploy/istiod-gloo --tail=300 \
  | grep -v "waiting for sync" \
  | grep -iE "error|auth|cluster"

# Common root cause line:
# "client claims to be in cluster \"Kubernetes\", but we only know about
#  local cluster \"east-ag\" and remote clusters [west-ag]"
# Fix: patch the waypoint Deployment with CLUSTER_ID=<cluster-name>

Which proxies are connected and synced?

istioctl --context $CLUSTER1 proxy-status
# SYNCED = healthy
# STALE  = config pushed but not yet ACK'd
# NOT SENT = istiod hasn't pushed config at all (stuck filter)

Is the remote cluster peering alive?

istioctl multicluster check --verbose \
  --contexts="${CLUSTER1},${CLUSTER2}"

# Checks: license, pod health, east-west gateway programmed,
# PeeringSucceeded, PeerConnected, PeerDataPlaneProgrammed

Required peering env vars — verify they're set

for CTX in $CLUSTER1 $CLUSTER2; do
  echo "=== $CTX ==="
  echo "--- istiod (need PILOT_ENABLE_K8S_SELECT_WORKLOAD_ENTRIES=false) ---"
  kubectl --context $CTX get deploy istiod-gloo -n istio-system \
    -o jsonpath='{range .spec.template.spec.containers[0].env[*]}{.name}={.value}{"\n"}{end}' \
    | grep -E "K8S_SELECT|PEERING|CLUSTER_ID|LICENSE"

  echo "--- ztunnel (need L7_ENABLED=true) ---"
  kubectl --context $CTX get ds ztunnel -n istio-system \
    -o jsonpath='{range .spec.template.spec.containers[0].env[*]}{.name}={.value}{"\n"}{end}' \
    | grep -E "L7_ENABLED|NETWORK|CLUSTER_ID"
done

ztunnel workload and waypoint view

ZTUNNEL=$(kubectl -n istio-system get pod -l app=ztunnel -o name | head -1 | sed 's|pod/||')

# All workloads ztunnel knows about (including cross-cluster)
istioctl ztunnel-config workloads $ZTUNNEL -n istio-system

# Services and their waypoints
istioctl ztunnel-config services $ZTUNNEL -n istio-system