The Grafana dashboard recipe defines PromQL alerts for ban signals, start failures, prod pool low, and DLQ depth. Alertmanager must route those to **actionable Slack channels** — not a generic #alerts firehose. This recipe wires Alertmanager receivers with severity-based routing and runbook deep links.
Alert → channel map
| Alert | Severity | Slack channel | First action |
|---|---|---|---|
MLXBanSignalProd | critical | #mlx-ban-ops | Ban recovery runbook |
MLXDLQDepthHigh | warning | #mlx-automation | DLQ handler |
MLXProdPoolLow | warning | #mlx-capacity | Profile pool |
MLXStartFailureHigh | warning | #mlx-automation | Profile debug |
alertmanager.yml receivers
global:
resolve_timeout: 5m
route:
receiver: slack-default
group_by: ['alertname', 'platform', 'client_id']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
alertname: MLXBanSignalProd
receiver: slack-ban-ops
continue: false
- match:
severity: warning
alertname: MLXDLQDepthHigh
receiver: slack-automation
- match:
severity: warning
alertname: MLXProdPoolLow
receiver: slack-capacity
receivers:
- name: slack-default
slack_configs:
- api_url: '${SLACK_WEBHOOK_DEFAULT}'
channel: '#mlx-alerts'
send_resolved: true
title: '{{ .Status | toUpper }} — {{ .CommonLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: slack-ban-ops
slack_configs:
- api_url: '${SLACK_WEBHOOK_BAN_OPS}'
channel: '#mlx-ban-ops'
send_resolved: true
color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
title: '🚨 BAN SIGNAL — {{ .CommonLabels.platform }}'
text: |
*Client:* {{ .CommonLabels.client_id }}
*Profile tier:* prod
*Runbook:* https://mlx-mmo.github.io/guides/multilogin-ban-recovery-runbook.html
*Forensics:* clone before any proxy change
{{ range .Alerts }}• {{ .Annotations.summary }}{{ end }}
- name: slack-automation
slack_configs:
- api_url: '${SLACK_WEBHOOK_AUTOMATION}'
channel: '#mlx-automation'
send_resolved: true
title: 'Automation — {{ .CommonLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: slack-capacity
slack_configs:
- api_url: '${SLACK_WEBHOOK_CAPACITY}'
channel: '#mlx-capacity'
send_resolved: true
title: 'Pool capacity — {{ .CommonLabels.alertname }}'
text: 'Prod pool below threshold. Check warm tier promotion and burn tier drain.'
Slack incoming webhook setup
- Create Slack app → **Incoming Webhooks** → enable per workspace.
- One webhook URL per channel (
#mlx-ban-ops, etc.) — do not reuse a single URL withchanneloverride unless using legacy tokens. - Store URLs in secrets manager — inject as env vars at Alertmanager deploy.
- Test with
amtool alert addbefore enabling prod routes.
Ban alert enrichment (optional sidecar)
When MLXBanSignalProd fires, a small webhook sidecar can POST to your CMDB API to tag tier=burn and pause queue routing for that profile_id. Keep Alertmanager dumb — sidecar handles idempotent CMDB writes. Pair with webhook receiver HMAC pattern.
# Sidecar listens on /hooks/alertmanager
# Payload: Alertmanager webhook v4 JSON
# On ban alert: PATCH cmdb/profiles/{id} tier=burn, LPUSH mlx:dlq review job
Metrics dependency
Pool and DLQ alerts require Redis exporter sidecar metrics (mlx_pool_depth, mlx_dlq_depth). Ban signals come from worker instrumentation in the observability guide.
Related
Disclosure: MLX-MMO affiliated with Multilogin.