The observability guide defines metrics — this recipe gives you a **Grafana dashboard layout** ops teams actually use: one screen for profile health, queue pressure, and ban signals. Assumes Prometheus scrapes workers on :9100/metrics and Redis exporter for queue depth.

Dashboard rows (top → bottom)

RowPanelsPurpose
OverviewStart success %, p95 latency, active jobsIs the fleet healthy right now?
Ban & riskBan signals/h by platform, burn tier countEarly warning before finance
Queue & DLQLLEN mlx:jobs, DLQ depth, webhook fail rateBackpressure and stuck jobs
Profile poolSCARD prod/warm/burn, health fail rateCapacity for new jobs
Cloud phoneADB offline, Appium session durationMobile layer health

Key PromQL queries

# Start success rate (5m)
sum(rate(mlx_profile_start_total{status="ok"}[5m]))
  / sum(rate(mlx_profile_start_total[5m]))

# p95 start latency
histogram_quantile(0.95,
  sum(rate(mlx_profile_start_duration_seconds_bucket[5m])) by (le, worker_host))

# Ban signals last 24h by platform
sum(increase(mlx_ban_signal_total{tier="prod"}[24h])) by (platform)

# CDP reconnect rate per job (approx)
sum(rate(mlx_cdp_reconnect_total[1h]))
  / sum(rate(mlx_job_finished_total[1h]))

# Profile pool depth (Redis exporter — custom metric or script)
redis_key_size{key="mlx:pool:prod"}
redis_key_size{key="mlx:pool:burn"}

Panel JSON snippet (stat — start success)

{
  "title": "Profile Start Success (5m)",
  "type": "stat",
  "targets": [{
    "expr": "sum(rate(mlx_profile_start_total{status=\"ok\"}[5m])) / sum(rate(mlx_profile_start_total[5m]))",
    "legendFormat": "success"
  }],
  "fieldConfig": {
    "defaults": {
      "unit": "percentunit",
      "thresholds": {
        "steps": [
          {"color": "red", "value": 0},
          {"color": "yellow", "value": 0.92},
          {"color": "green", "value": 0.97}
        ]
      }
    }
  }
}

Import via Grafana → Dashboards → Import → paste panel or build manually. **Full dashboard bundle:** download grafana-mlx-health-dashboard.json (5 panels: start success, pool depth, DLQ, health probe fail rate, ban signals). Full metric schema: observability guide.

Alertmanager rules (YAML)

groups:
  - name: mlx_prod
    rules:
      - alert: MLXBanSignalProd
        expr: increase(mlx_ban_signal_total{tier="prod"}[1h]) > 0
        for: 0m
        labels: { severity: critical }
        annotations:
          summary: "Ban signal on prod tier"
          description: "Platform {{ $labels.platform }} — run ban recovery runbook"

      - alert: MLXStartFailureHigh
        expr: |
          sum(rate(mlx_profile_start_total{status="error"}[15m]))
          / sum(rate(mlx_profile_start_total[15m])) > 0.05
        for: 10m
        labels: { severity: warning }

      - alert: MLXProdPoolLow
        expr: redis_key_size{key="mlx:pool:prod"} < 5
        for: 15m
        labels: { severity: warning }

      - alert: MLXDLQDepthHigh
        expr: redis_key_size{key="mlx:dlq"} > 10
        for: 15m
        labels: { severity: warning }

On MLXBanSignalProd, webhook to ops Slack — see Alertmanager → Slack recipe and ban recovery runbook.

Health probe alerts (cron → Grafana)

Wire metrics from the health check cron (mlx_health_probe_total, mlx_health_probe_seconds) into the same dashboard. The cron emits on :9102/metrics — add a second scrape job or federate into Prometheus.

# Probe fail rate by tier (15m)
sum(rate(mlx_health_probe_total{status="fail"}[15m])) by (tier)
  / sum(rate(mlx_health_probe_total[15m])) by (tier)

# p99 probe duration — MLX API degradation
histogram_quantile(0.99,
  sum(rate(mlx_health_probe_seconds_bucket[15m])) by (le, tier))
      - alert: MLXHealthProbeFailSpike
        expr: |
          sum(rate(mlx_health_probe_total{status="fail",tier="prod"}[1h]))
          / sum(rate(mlx_health_probe_total{tier="prod"}[1h])) > 0.10
        for: 30m
        labels: { severity: warning }
        annotations:
          summary: "Prod pool probe fail rate > 10%"
          description: "Check proxy health and MLX API — demotion may follow per cron policy"

      - alert: MLXHealthProbeSlow
        expr: |
          histogram_quantile(0.99,
            sum(rate(mlx_health_probe_seconds_bucket[15m])) by (le)) > 60
        for: 15m
        labels: { severity: warning }
        annotations:
          summary: "Health probe p99 > 60s"

Route MLXHealthProbeFailSpike to Slack (#mlx-capacity). Escalate to PagerDuty only when MLXProdPoolLow fires in the same window — see PagerDuty recipe. Panel row: add under Profile pool next to SCARD prod.

Redis exporter for queue / pool

Sidecar recipe: Redis exporter sidecar — polls LLEN, SCARD, lease keys for Grafana.

Variables (Grafana dashboard)

Related

Disclosure: MLX-MMO affiliated with Multilogin.