The observability guide defines metrics — this recipe gives you a **Grafana dashboard layout** ops teams actually use: one screen for profile health, queue pressure, and ban signals. Assumes Prometheus scrapes workers on :9100/metrics and Redis exporter for queue depth.
Dashboard rows (top → bottom)
| Row | Panels | Purpose |
|---|---|---|
| Overview | Start success %, p95 latency, active jobs | Is the fleet healthy right now? |
| Ban & risk | Ban signals/h by platform, burn tier count | Early warning before finance |
| Queue & DLQ | LLEN mlx:jobs, DLQ depth, webhook fail rate | Backpressure and stuck jobs |
| Profile pool | SCARD prod/warm/burn, health fail rate | Capacity for new jobs |
| Cloud phone | ADB offline, Appium session duration | Mobile layer health |
Key PromQL queries
# Start success rate (5m)
sum(rate(mlx_profile_start_total{status="ok"}[5m]))
/ sum(rate(mlx_profile_start_total[5m]))
# p95 start latency
histogram_quantile(0.95,
sum(rate(mlx_profile_start_duration_seconds_bucket[5m])) by (le, worker_host))
# Ban signals last 24h by platform
sum(increase(mlx_ban_signal_total{tier="prod"}[24h])) by (platform)
# CDP reconnect rate per job (approx)
sum(rate(mlx_cdp_reconnect_total[1h]))
/ sum(rate(mlx_job_finished_total[1h]))
# Profile pool depth (Redis exporter — custom metric or script)
redis_key_size{key="mlx:pool:prod"}
redis_key_size{key="mlx:pool:burn"}
Panel JSON snippet (stat — start success)
{
"title": "Profile Start Success (5m)",
"type": "stat",
"targets": [{
"expr": "sum(rate(mlx_profile_start_total{status=\"ok\"}[5m])) / sum(rate(mlx_profile_start_total[5m]))",
"legendFormat": "success"
}],
"fieldConfig": {
"defaults": {
"unit": "percentunit",
"thresholds": {
"steps": [
{"color": "red", "value": 0},
{"color": "yellow", "value": 0.92},
{"color": "green", "value": 0.97}
]
}
}
}
}
Import via Grafana → Dashboards → Import → paste panel or build manually. **Full dashboard bundle:** download grafana-mlx-health-dashboard.json (5 panels: start success, pool depth, DLQ, health probe fail rate, ban signals). Full metric schema: observability guide.
Alertmanager rules (YAML)
groups:
- name: mlx_prod
rules:
- alert: MLXBanSignalProd
expr: increase(mlx_ban_signal_total{tier="prod"}[1h]) > 0
for: 0m
labels: { severity: critical }
annotations:
summary: "Ban signal on prod tier"
description: "Platform {{ $labels.platform }} — run ban recovery runbook"
- alert: MLXStartFailureHigh
expr: |
sum(rate(mlx_profile_start_total{status="error"}[15m]))
/ sum(rate(mlx_profile_start_total[15m])) > 0.05
for: 10m
labels: { severity: warning }
- alert: MLXProdPoolLow
expr: redis_key_size{key="mlx:pool:prod"} < 5
for: 15m
labels: { severity: warning }
- alert: MLXDLQDepthHigh
expr: redis_key_size{key="mlx:dlq"} > 10
for: 15m
labels: { severity: warning }
On MLXBanSignalProd, webhook to ops Slack — see Alertmanager → Slack recipe and ban recovery runbook.
Health probe alerts (cron → Grafana)
Wire metrics from the health check cron (mlx_health_probe_total, mlx_health_probe_seconds) into the same dashboard. The cron emits on :9102/metrics — add a second scrape job or federate into Prometheus.
# Probe fail rate by tier (15m)
sum(rate(mlx_health_probe_total{status="fail"}[15m])) by (tier)
/ sum(rate(mlx_health_probe_total[15m])) by (tier)
# p99 probe duration — MLX API degradation
histogram_quantile(0.99,
sum(rate(mlx_health_probe_seconds_bucket[15m])) by (le, tier))
- alert: MLXHealthProbeFailSpike
expr: |
sum(rate(mlx_health_probe_total{status="fail",tier="prod"}[1h]))
/ sum(rate(mlx_health_probe_total{tier="prod"}[1h])) > 0.10
for: 30m
labels: { severity: warning }
annotations:
summary: "Prod pool probe fail rate > 10%"
description: "Check proxy health and MLX API — demotion may follow per cron policy"
- alert: MLXHealthProbeSlow
expr: |
histogram_quantile(0.99,
sum(rate(mlx_health_probe_seconds_bucket[15m])) by (le)) > 60
for: 15m
labels: { severity: warning }
annotations:
summary: "Health probe p99 > 60s"
Route MLXHealthProbeFailSpike to Slack (#mlx-capacity). Escalate to PagerDuty only when MLXProdPoolLow fires in the same window — see PagerDuty recipe. Panel row: add under Profile pool next to SCARD prod.
Redis exporter for queue / pool
Sidecar recipe: Redis exporter sidecar — polls LLEN, SCARD, lease keys for Grafana.
Variables (Grafana dashboard)
$client_id— filter by agency client$platform— amazon, tiktok_shop, shopee, etc.$worker_host— per-worker drill-down
Related
Disclosure: MLX-MMO affiliated with Multilogin.