Istio Ambient Metrics & Alerting, by Tom O'Rourke

What you'll learn

This is a field reference, not a step-by-step lab. Use it to know:

Which istiod metrics flag cert expiry, a dropped multi-cluster XDS link, or a stalled xDS push, and the threshold to alert at.
What wds and wads mean in Ambient, and why the classic lds/rds/cds/eds counters should sit at zero.
How to scrape istiod-gloo with a ServiceMonitor and ship the alerts as a PrometheusRule.
How to diagnose the waiting for sync deadlock, most often a waypoint presenting the wrong CLUSTER_ID.

How to collect these metrics

Port-forward to the istiod metrics endpoint

kubectl -n istio-system port-forward svc/istiod-gloo 15014:15014 &
curl -s http://localhost:15014/metrics
kill %1

Everything below comes from http://istiod-gloo.istio-system.svc:15014/metrics. For anything beyond ad-hoc inspection, drop a ServiceMonitor on port 15014 and let Prometheus do the scraping. The YAML for that is right below.

⬇ service-monitor.yaml · ⬇ prometheus-alerts.yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: istiod-gloo
  namespace: istio-system
spec:
  selector:
    matchLabels:
      app: istiod
      istio.io/rev: gloo
  endpoints:
  - port: http-monitoring   # port 15014
    interval: 15s
    path: /metrics

xDS protocol types — what wds and wads mean

In Ambient mode the old sidecar xDS types (LDS / RDS / CDS / EDS) are replaced for the data plane by two Ambient-specific APIs — so on a pure Ambient cluster you'll only see wds and wads in push metrics. If the classic types are showing up too, there are still sidecars somewhere in the mesh.

wds

Workload Discovery Service

Pushes workload identity and address state to ztunnel — pod IPs, SPIFFE IDs, service VIPs, and endpoint health. ztunnel on every node subscribes to WDS to build its HBONE tunnel routing table. Replaces EDS + CDS for the Ambient data plane.

wads

Waypoint Address Discovery Service

Pushes waypoint Gateway addresses and service-to-waypoint bindings to ztunnel. Tells ztunnel which services have a waypoint and where to send L7-bound traffic before forwarding to the destination. Without WADS, ztunnel bypasses waypoints.

lds / rds / cds / eds

Classic sidecar xDS types

Listener / Route / Cluster / Endpoint Discovery. Only present if sidecar Envoy proxies are connected (non-Ambient workloads). In a pure Ambient cluster these counters stay at zero.

type.googleapis.com/istio.workload.Address

WDS payload type

The protobuf type pushed over WDS. The pilot_xds_config_size_bytes histogram uses this as a label — lets you track how large each WDS push is as your workload count grows (expect ~500B–1KB per workload).

Certificate & CA health

istiod is the mesh CA, so it knows exactly when its own certs expire — and it tells you. Root cert expiry is the metric to care about. Rotating a root CA mid-flight is the kind of operation you want to plan and rehearse, not improvise: every intermediate has to be re-issued, and any workload that doesn't pick up the rotation cleanly becomes a P1. Wire the 30-day alert in. Earlier if you can.

Metric	Type	What it measures	Alert when
`citadel_server_root_cert_expiry_seconds`	gauge	Seconds until the root CA cert expires. Negative = already expired.	< 2592000 (30 days)
`citadel_server_root_cert_expiry_timestamp`	gauge	Unix timestamp of root cert expiry — useful for dashboards.	< time() + 2592000
`citadel_server_cert_chain_expiry_seconds`	gauge	Seconds until the istiod-issued workload cert chain expires.	< 86400 (1 day)
`citadel_server_cert_chain_expiry_timestamp`	gauge	Unix timestamp of workload cert chain expiry.	< time() + 86400

Demo vs prod, on the numbers: the self-signed certs kind spins up (and what this repo uses) are good for 10 years, so the gauges barely move. In prod with BYO-CA intermediates the cert chain is typically 1–5 years, and the 30-day warning is the one that gives you a usable window to rotate.

Multi-cluster connectivity

If you're picking one metric to know whether multicluster is currently working, pick istiod_managed_clusters. remote=0 on a cluster that should have peers means istiod has lost the XDS link to the other side and cross-cluster endpoint rewriting has stopped — east-bound calls will start failing if they haven't already. Page on this one.

Metric	Type	What it measures	Alert when
`istiod_managed_clusters{cluster_type="local"}`	gauge	Always 1 — confirms this istiod is managing its own cluster.	!= 1
`istiod_managed_clusters{cluster_type="remote"}`	gauge	Number of remote clusters this istiod has live XDS connections to. Drops to 0 when the remote secret is missing or the peer istiod is unreachable.	< expected peer count
`istiod_uptime_seconds`	gauge	How long since istiod started. Frequent low values = crash-looping.	rate resets unexpectedly
`istio_build{tag="..."}`	gauge	Always 1 — the `tag` label carries the Solo Istio version string.	version changes unexpectedly

When this alert fires: after an istiod pod restart, give the remote secret reconnect 2–3 minutes before treating it as an incident — it isn't instant. Still 0 after that? Two places to look. First kubectl -n istio-system get secret | grep istio-remote-secret to confirm the secret is present, then kubectl logs deploy/istiod-gloo | grep -i "peer\|remote\|delta" for the underlying error.

Proxy sync & xDS push health

These metrics catch one of the worst failure modes in Ambient: the waiting for sync deadlock. One misbehaving client fails auth, istiod's discovery filter clams up, and every xDS push stops. The counters flatline, the push time histograms quit moving, and workloads keep running on stale config — silently, often for hours. This is the section where the alerts earn their keep.

Metric	Type	What it measures	Alert when
`pilot_xds{version="..."}`	gauge	Live XDS connections right now. In a pure Ambient cluster expect 2× ztunnel pods + N waypoints.	drops unexpectedly
`pilot_proxy_convergence_time_count`	histogram counter	Total proxy config pushes that successfully completed (proxy ACK'd). Stops incrementing when the discovery filter is blocked.	rate(…[5m]) == 0 while pilot_xds > 0
`pilot_proxy_convergence_time_sum / _count`	histogram	Average time from config change to proxy ACK. In a healthy kind cluster <10ms. Above 1s signals push queue pressure.	avg > 1s
`pilot_proxy_queue_time_count`	histogram counter	Proxies dequeued from the push queue. Should match convergence count in a healthy cluster.	diverges from convergence count
`pilot_xds_pushes{type="wds"}`	counter	Total WDS pushes (workload address updates to ztunnel). Stalls when the discovery filter is blocked.	rate(…[5m]) == 0
`pilot_xds_pushes{type="wads"}`	counter	Total WADS pushes (waypoint address updates to ztunnel). Stalls when discovery filter is blocked.	rate(…[5m]) == 0
`pilot_xds_push_time_sum / _count{type="wds"}`	histogram	Average time to generate and send a WDS push. Should be <5ms on kind.	avg > 500ms
`pilot_xds_push_time_sum / _count{type="wads"}`	histogram	Average time to generate and send a WADS push.	avg > 500ms
`pilot_xds_recv_max`	gauge	Largest xDS request (ACK/NACK) received from any client in bytes. Useful for detecting unexpectedly large proxy state.	—
`pilot_xds_config_size_bytes{type="istio.workload.Address"}`	histogram	Distribution of WDS push payload sizes. Grows linearly with workload count (~500B–1KB per workload). Watch for sudden spikes.	sudden spike > 2×baseline

"Waiting for sync" isn't a metric. It only shows up in istiod logs. The signal you can alert on is pilot_xds_pushes flat while pilot_xds{} > 0 — proxies connected, nothing being pushed. The most common cause is a proxy presenting the wrong CLUSTER_ID. agentgateway waypoints in particular default to "Kubernetes" instead of the actual cluster name, which makes istiod refuse to talk to them. Fix: patch the waypoint Deployment with the correct CLUSTER_ID env var.

Push pipeline & config churn

Metric	Type	What it measures	Alert when
`pilot_push_triggers{type="ambient"}`	counter	Push batches triggered by Ambient-specific config changes (ztunnel labels, waypoint changes).	unexpected spike
`pilot_push_triggers{type="endpoint"}`	counter	Push batches from endpoint changes (pod start/stop, rolling deploys).	sustained high rate outside deploy windows
`pilot_push_triggers{type="global"}`	counter	Full mesh-wide push triggers — fired on config changes that affect all proxies (new Policy, new Service). High rate = config storm.	rate > 5/min
`pilot_debounce_time_sum / _count`	histogram	Average time config changes are held in the debounce window before being merged into a single push. High values mean a config storm is generating rapid successive changes.	avg > 1s
`pilot_pushcontext_init_seconds`	histogram	Time to fully rebuild the push context (mesh-wide config snapshot). High values indicate a slow Kubernetes API or very large config surface.	avg > 1s
`pilot_inbound_updates{type="config"}`	counter	Config object change events received from Kubernetes (Gateway, HTTPRoute, Policy, etc.).	rate spike outside deploy windows
`pilot_inbound_updates{type="eds"}`	counter	Endpoint slice change events — spikes on rolling deploys, stays low in steady state.	sustained >10/s
`pilot_services`	gauge	Total services known to istiod (K8s Services + ServiceEntries). Unexpected drop = services removed or istiod lost its K8s watch.	drops unexpectedly

Errors & config conflicts

Metric	Type	What it measures	Alert when
`endpoint_no_pod`	gauge	Endpoint addresses with no matching pod — stale endpoints from crashed pods. Should be 0 in steady state.	> 0 for > 60s
`pilot_eds_no_instances`	gauge	EDS clusters (services) with zero endpoints — traffic to these services will 503.	> 0
`pilot_endpoint_not_ready`	gauge	Endpoints in unready state (readiness probe failing). Istiod excludes these from traffic — confirm readiness probe config if persistently >0.	> 0 for > 120s
`pilot_no_ip`	gauge	Pods not in the endpoint table — pod/endpoint sync lag or evicted pods. Should be 0.	> 0
`pilot_conflict_inbound_listener`	gauge	Conflicting inbound listener configurations — two services fighting for the same port. Traffic will be misrouted.	> 0
`pilot_conflict_outbound_listener_tcp_over_current_tcp`	gauge	Conflicting outbound TCP listeners. Indicates ServiceEntry or Service port collisions.	> 0
`pilot_duplicate_envoy_clusters`	gauge	Duplicate Envoy cluster names caused by ServiceEntries sharing a hostname. Can cause silent traffic mis-routing.	> 0
`pilot_destrule_subsets`	gauge	Duplicate DestinationRule subsets across rules targeting the same host.	> 0
`galley_validation_config_update_error{reason="Conflict"}`	counter	Webhook configuration update conflicts — commonly the Gloo Operator trying to update a webhook config already owned by another controller.	> 0 and increasing
`pilot_k8s_proxies_with_no_service_targets`	counter	Proxies (typically waypoint Gateways) with no matching K8s Service targets. Expected for waypoint-style Gateways that don't back a Service directly.	—

Process & runtime health

Metric	Type	What it measures	Alert when
`go_goroutines`	gauge	Active goroutines in istiod. Normal range at startup: 800–1500. Sustained growth above 3000 indicates a goroutine leak.	> 3000 and growing
`process_resident_memory_bytes`	gauge	RSS memory. istiod typically uses 100–500 MB depending on mesh size. On kind (constrained nodes) watch for OOMKill.	> 800MB on kind
`go_memstats_heap_alloc_bytes`	gauge	Live heap allocation. Sustained growth between GC cycles indicates a memory leak.	sustained growth trend
`process_cpu_seconds_total`	counter	Cumulative CPU time. Use `rate(…[5m])` for per-second load. High sustained rate during steady state = config churn or push loop.	rate > 2 cores steady-state
`process_open_fds`	gauge	Open file descriptors. Each XDS stream + K8s watch consumes an fd. Growth over time = fd leak.	> 80% of process_max_fds
`go_sched_gomaxprocs_threads`	gauge	GOMAXPROCS — number of OS threads istiod can run Go code on. Reflects CPU limit of the pod.	—

Prometheus alert rules

Paste this into a PrometheusRule CR if you're on kube-prometheus-stack, or a standalone rules.yaml if not. Every rule lines up with a metric from the tables above. The dashboard ships with most of these signals already; the alert YAML doesn't, so it's here.

⬇ Download prometheus-alerts.yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: istio-ambient-alerts
  namespace: monitoring
  labels:
    prometheus: kube-prometheus
    role: alert-rules
spec:
  groups:

  - name: istio.cert
    interval: 60s
    rules:

    - alert: IstiodRootCertExpiryWarning
      expr: citadel_server_root_cert_expiry_seconds < 2592000
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "istiod root CA cert expiring in < 30 days"
        description: "Root cert expires in {{ humanizeDuration $value }}. Root CA rotation is disruptive — plan ahead."

    - alert: IstiodRootCertExpiryCritical
      expr: citadel_server_root_cert_expiry_seconds < 604800
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "istiod root CA cert expiring in < 7 days"
        description: "Root cert expires in {{ humanizeDuration $value }}. Immediate action required."

    - alert: IstiodWorkloadCertExpiryWarning
      expr: citadel_server_cert_chain_expiry_seconds < 86400
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "istiod workload cert chain expiring in < 1 day"
        description: "Cert chain expires in {{ humanizeDuration $value }}."

  - name: istio.multicluster
    interval: 30s
    rules:

    - alert: IstiodRemoteClusterDisconnected
      expr: istiod_managed_clusters{cluster_type="remote"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "istiod has no remote cluster connections"
        description: "Cross-cluster endpoint rewriting is broken. Check istio-remote-secret-* in istio-system and peer istiod logs."

    - alert: IstiodDown
      expr: istiod_managed_clusters{cluster_type="local"} != 1
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "istiod is not managing its local cluster"
        description: "istiod local cluster gauge != 1. Pod may be crash-looping or metrics endpoint is unreachable."

  - name: istio.xds.sync
    interval: 30s
    rules:

    - alert: IstiodXdsPushStall
      expr: |
        rate(pilot_xds_pushes[5m]) == 0
        and
        pilot_xds > 0
      for: 3m
      labels:
        severity: critical
      annotations:
        summary: "istiod xDS pushes have stalled"
        description: "{{ $labels.type }} push rate is 0 but {{ $value }} proxies are connected. Discovery filter may be blocked — check istiod logs for 'waiting for sync' and auth errors."

    - alert: IstiodProxyConvergenceSlow
      expr: |
        rate(pilot_proxy_convergence_time_sum[5m])
        /
        rate(pilot_proxy_convergence_time_count[5m]) > 1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Proxy config convergence avg > 1s"
        description: "Average time for a proxy to receive and ACK config is {{ $value | humanizeDuration }}. Push queue may be overloaded."

    - alert: IstiodWdsPushSlow
      expr: |
        rate(pilot_xds_push_time_sum{type="wds"}[5m])
        /
        rate(pilot_xds_push_time_count{type="wds"}[5m]) > 0.5
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "WDS push time avg > 500ms"
        description: "Workload Discovery Service pushes to ztunnel are slow ({{ $value | humanizeDuration }}). May indicate large workload counts or push queue contention."

  - name: istio.config.errors
    interval: 60s
    rules:

    - alert: IstiodStaleEndpoints
      expr: endpoint_no_pod > 0
      for: 60s
      labels:
        severity: warning
      annotations:
        summary: "{{ $value }} endpoints with no backing pod"
        description: "Stale endpoint entries — crashed pod endpoints not cleaned up. Traffic to these endpoints will fail."

    - alert: IstiodEmptyService
      expr: pilot_eds_no_instances > 0
      for: 30s
      labels:
        severity: warning
      annotations:
        summary: "{{ $value }} services have zero endpoints"
        description: "Services with no endpoints will return 503. May indicate a deployment failure or misconfigured selector."

    - alert: IstiodListenerConflict
      expr: pilot_conflict_inbound_listener > 0 or pilot_conflict_outbound_listener_tcp_over_current_tcp > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "istiod listener conflicts detected"
        description: "{{ $value }} conflicting listeners. Traffic may be misrouted. Check for Service port collisions or duplicate ServiceEntries."

    - alert: IstiodConfigPushStorm
      expr: rate(pilot_push_triggers{type="global"}[5m]) > 0.08
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "istiod global push rate > 5/min"
        description: "Frequent full mesh-wide pushes indicate a config storm. Check for a controller repeatedly updating CRDs."

  - name: istio.process
    interval: 60s
    rules:

    - alert: IstiodGoroutineLeak
      expr: go_goroutines{job="istiod"} > 3000
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "istiod goroutine count > 3000"
        description: "{{ $value }} goroutines. Sustained growth suggests a goroutine leak. Restart istiod if count keeps growing."

    - alert: IstiodHighMemory
      expr: process_resident_memory_bytes{job="istiod"} > 838860800
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "istiod RSS > 800MB"
        description: "{{ $value | humanize1024 }}B resident memory. On resource-constrained nodes this risks OOMKill."

    - alert: IstiodFdExhaustion
      expr: process_open_fds{job="istiod"} / process_max_fds{job="istiod"} > 0.8
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "istiod file descriptor usage > 80%"
        description: "{{ $value | humanizePercentage }} of max FDs in use. Approaching exhaustion will cause new XDS stream failures."

Diagnostic commands

Is the discovery filter stuck?

# Find the sync wait (grep past the noise to the root cause)
kubectl -n istio-system logs deploy/istiod-gloo --tail=300 \
  | grep -v "waiting for sync" \
  | grep -iE "error|auth|cluster"

# Common root cause line:
# "client claims to be in cluster \"Kubernetes\", but we only know about
#  local cluster \"east-ag\" and remote clusters [west-ag]"
# Fix: patch the waypoint Deployment with CLUSTER_ID=<cluster-name>

Which proxies are connected and synced?

istioctl --context $CLUSTER1 proxy-status
# SYNCED = healthy
# STALE  = config pushed but not yet ACK'd
# NOT SENT = istiod hasn't pushed config at all (stuck filter)

Is the remote cluster peering alive?

istioctl multicluster check --verbose \
  --contexts="${CLUSTER1},${CLUSTER2}"

# Checks: license, pod health, east-west gateway programmed,
# PeeringSucceeded, PeerConnected, PeerDataPlaneProgrammed

Required peering env vars — verify they're set

for CTX in $CLUSTER1 $CLUSTER2; do
  echo "=== $CTX ==="
  echo "--- istiod (need PILOT_ENABLE_K8S_SELECT_WORKLOAD_ENTRIES=false) ---"
  kubectl --context $CTX get deploy istiod-gloo -n istio-system \
    -o jsonpath='{range .spec.template.spec.containers[0].env[*]}{.name}={.value}{"\n"}{end}' \
    | grep -E "K8S_SELECT|PEERING|CLUSTER_ID|LICENSE"

  echo "--- ztunnel (need L7_ENABLED=true) ---"
  kubectl --context $CTX get ds ztunnel -n istio-system \
    -o jsonpath='{range .spec.template.spec.containers[0].env[*]}{.name}={.value}{"\n"}{end}' \
    | grep -E "L7_ENABLED|NETWORK|CLUSTER_ID"
done

ztunnel workload and waypoint view

ZTUNNEL=$(kubectl -n istio-system get pod -l app=ztunnel -o name | head -1 | sed 's|pod/||')

# All workloads ztunnel knows about (including cross-cluster)
istioctl ztunnel-config workloads $ZTUNNEL -n istio-system

# Services and their waypoints
istioctl ztunnel-config services $ZTUNNEL -n istio-system